‘Similarity’ Approximate String Matching library is now on GitHub

In a Guerrilla Analytics environment, available tooling is often limited. There is either not enough budget, time or IT flexibility to get all the tools you want.

On many jobs, I find myself using Microsoft SQL Server as the project RDBMS. Out of the box, SQL Server does not yet have a fuzzy match capability. You need to install additional tools such as SSIS to avail of fuzzy matching. Even then, SSIS is a GUI-driven application which contradicts a key Guerrilla Analytics Principle. In a Guerrilla Analytics environment, you would much rather have fuzzy match capabilities available in SQL code. This is where the following Similarity library comes in handy.

Words

Guerrilla Analytics Challenge

In a Guerrilla Analytics environment, available tooling is often limited. There is either not enough budget, time or IT flexibility to get all the tools you want.

On many jobs, I find myself using Microsoft SQL Server as the project RDBMS. Out of the box, SQL Server does not yet have a fuzzy match capability. You need to install additional tools such as SSIS to avail of fuzzy matching. Even then, SSIS is a GUI-driven application which contradicts a key Guerrilla Analytics Principle. In a Guerrilla Analytics environment, you would much rather have fuzzy match capabilities available in SQL code. This is where the following Similarity library comes in handy.

Introducing Similarity

Similarity is a wrapper around the SimMetrics string matching library created by Sheffield University and funded by an IRC sponsored by EPSRC, grant number GR/N15764/01.

SimMetrics includes approximate string comparison algorithms such as:

  • Levenshtein
  • Jaro
  • Jaro-Winkler
  • Needleman
  • and many more

The Similarity wrapper makes these SimMetrics algorithms available in-line in SQL Server so you can call them from SQL code.

The approach for creating this wrapper was inspired by this blog post. I’ve added to the original code to produce a primitive Windows end-to-end build process that creates a SimMetrics C# DLL library and loads it into a Microsoft SQL Server database.

Go check it out

There is more information, installation instructions and the latest version at GitHub. Your contributions and comments are welcome!

Data Science Workflows – A Reality Check

Data Science projects aren’t a nice clean cycle of well defined stages. More often, they are a slog towards delivery with repeated setbacks. Most steps are highly iterative between your Data Science team and IT or your Data Science team and the business. These setbacks are due to disruptions. Recognising this and identifying the cause of these disruptions is the first step in mitigating their impact on your delivery with Guerrilla Analytics.

workflow

Data Science projects aren’t a nice clean cycle of well defined stages. More often, they are a slog towards delivery with repeated setbacks. Most steps are highly iterative between your Data Science team and IT or your Data Science team and the business. These setbacks are due to disruptions. Recognising this and identifying the cause of these disruptions is the first step in mitigating their impact on your delivery with Guerrilla Analytics.

The Situation

Doing Data Science work in consulting (both internal and external) is complicated. This is for a number of reasons that have nothing to do with machine learning algorithms, statistics and math, or model sophistication. The cause of this complexity is far more mundane.

  • Project requirements change often, especially as data understanding improves.
  • Data is poorly understood, contains flaws you have yet to discover, IT struggle to create the required data extracts for you etc.
  • Your team and the client’s team will have a variety of skills and experience
  • The technology available due to licensing costs and the client’s IT landscape may not be ideal.

The discussion of Data Science workflows does not sufficiently represent this reality. Most workflow representations are derived from the Cross-Industry Standard Process for Data Mining (CRISP-DM) [1].

CRISP-DM_Process_Diagram

Others report variations on CRISP-DM such as the blog post referenced below [2].

rp-overview

It’s all about disruptions

These workflow representations correctly capture the high level stages of Data Science, specifically:

  • defining the problem,
  • acquiring data,
  • preparing it,
  • doing some analysis and
  • reporting results

However, a more realistic representation must acknowledge that at pretty much every stage of Data Science, a variety of set backs or new knowledge can return you to any of the previous stages. You can think of these set backs and new knowledge as disruptions. They are disruptions because they necessitate modifying or redoing work instead of progressing directly to your goal of delivery. Here are some examples.

  • After doing some early analyses, a data profiling exercise reveals that some of your data extract has been truncated. It takes you significant time to check that you did not corrupt the file yourself when loading it. Now you have to go all the way back to source and get another data extract.
  • On creating a report, a business user highlights an unusual trend in your numbers. On investigation, you find a small bug in your code that when repaired, changes the contents of your report and requires re-issuing your report.
  • On presenting some updates to a client, you together agree there is no value in the current approach and a different one must be taken. No new data is required but you must now shape the data differently to apply a different kind of algorithm and analysis.

The list goes on. The point here is that Data Science on anything beyond a toy example is going to be a highly iterative process where at every stage, your techniques and approach need to be easily modified and re-run so that your analyses and code are robust to all of those disruptions.

The Guerrilla Analytics Workflow

Here is what I term the Guerrilla Analytics workflow. You can think of it like the game of Snakes and Ladders where any unlucky move sends you back down the board.

image

The Guerrilla Analytics workflow considers Data Science as the following stages from source data through to delivery. I’ve also added some examples of typical disruptions at each of these stages.

Data Science Workflow Example Disruptions
Extract: taking data from a source system, the web, front end system reports
  • incorrect data format extracted
  • truncated data
  • changing requirements mean different data is required
Receive: storing extracted data in the analytics environment and recording appropriate tracking information
  • lost data
  • file system mess of old data, modified data and raw data
  • multiple copies of data files
Load: transferring data from receipt location into an analytics environment
  • truncation of data
  • no clear link between data source and loaded datasets
Analytics: the data preparation, reshaping, modelling and visualization needed to solve the business problem
  • changing requirements
  • incorrect choice of analysis or model
  • dropping or overwriting records and columns so numbers cannot be explained
Work Products and Reporting: the ad-hoc analyses and formal project deliverables
  • changing requirements
  • incorrect or damaged data
  • code bugs
  • incorrect or unsuccessful analysis

This is just a sample of the disruptions that I have experienced in my projects. I’m sure you have more to add too and it would be great to hear them.

Further Reading

You can learn about disruptions and the practice tips for making your Data Science robust to disruptions in my book Guerrilla Analytics: A Practical Approach to Working with Data.

References

[1] Wikipedia https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining, Accessed 2015-02-14

[2] Communications of the ACM Blog, http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext

Building Guerrilla Analytics Teams

I recently had the opportunity to present a webinar on ‘Building Guerrilla Analytics Teams’ as part of the BrightTalk ‘Business Intelligence and Analytics’ series. You can access the full recorded webinar and slides here and the slides are embedded below.

Some really interesting questions came up at the end of the session. I’ve listed them here and will pick them up in subsequent blog posts.

  • How do you build a business case to resource and set up a data science team?
  • What is the number one tip for someone putting together a completely new data science team?
  • What role is most important when setting up a data science team?
  • What are the typical challenges faced when setting up a Guerrilla Analytics team?

You can learn more about building a Guerrilla Analytics capability in my book Guerrilla Analytics: A Practical Approach to Working with Data which has chapters devoted to getting the right people in place, giving them the right technology and controlling everything with a minimal lightweight process.

3 Lessons I Learned From Writing a Data Science Book – ‘Guerrilla Analytics – a practical approach to working with data’

One of the biggest challenges with writing a significant piece like a book chapter or entire book is to estimate how long it will take and plan accordingly. My best reference was my PhD which was still significantly shorter than the book’s target 90,000 words. This blog post is about the book writing process as I experienced it. I hope it helps other authors setting out on such an endeavour.

Since ‘Guerrilla Analytics: A Practical Approach to Working with Data’ is about operational aspects of agile data science, I recorded some data on the book writing process itself. Specifically, every time I finished a writing session, I recorded the number of words I’d written on that date.

Writing

One of the biggest challenges with writing a significant piece like a book chapter or entire book is to estimate how long it will take and plan accordingly. My best reference was my PhD which was still significantly shorter than the book’s target 90,000 words. This blog post is about the book writing process as I experienced it. I hope it helps other authors setting out on such an endeavour.

Since ‘Guerrilla Analytics: A Practical Approach to Working with Data‘ is about operational aspects of agile data science, I recorded some data on the book writing process itself. Specifically, every time I finished a writing session, I recorded the number of words I’d written on that date.

My 3 Lessons

  • Progress tapers off. You’ll get more work done in the first half of your project. Don’t expect this rate of progress to be sustained all the way to your deadline.
  • Be realistic about how much you can write in a session. I found it difficult to write more than 1,500 words. Anything more was the exception for me. Track your progress and re-plan accordingly.
  • Weekends are better than weekdays. Obvious maybe! Expect to set aside your free time on weekends to get your project over the line. It is difficult to get significant amounts of work done on weekdays.

Progress tapers off

  • Here is my progress towards my goal of 90,000 words over an 8 month period. The plot shows the words written per session and the total word count.writing_log_progress

    I began writing in late September and finished in June the following year. The line shows my total words written and the bars show the number of words written in individual writing sessions. Two things stand out:

  • Progress is faster in the first half of the project. This was because it is easier to get all your ideas ‘onto paper’ early in the writing. Once you have about 3 quarters of your manuscript complete, you need to be more careful about consistency of language and flow of content. This slows you down.
  • Time off work is really productive. There are two clear bursts of productivity as shown by the dense groups of grey bars where a large number of words was written in many successive sessions. The two periods are Halloween (when I took a week off work) and Christmas when I worked for a week from my family home.

How much did I write in a typical session?

Here’s how much I wrote in each writing session.

Words per session

I typically wrote about 1,000 words with the odd session where I wrote over 3,000 words. This is important when you plan your project. If you’re anything like me, writing more than 1,000 words will be an exception. If you only write on weekends then you’re looking at only 2,000 words per week. That’s well under 100,000 words in a year allowing for holidays and other disruptions.

Are you thinking about writing something and have questions? Feel free to get in touch and best of luck!

Guerrilla Analytics – the book! Book contract signed for Autumn 2014

Great news! I will be publishing a book on Guerrilla Analytics with Morgan Kaufmann in Autumn 2014. After lots of proposal crafting and contract negotiations the contracts have finally been signed and I can begin work. It will be about 90,000 words on Guerrilla Analytics covering topics such as:

  • what is data analytics and where does guerrilla analytics fit within that?
  • the principles of guerrilla analytics
  • worked examples at each stage of the data analytics workflow from data extraction and receipt through to delivery of work products. All of these examples will be supported by practice tips, case studies and war stories. This will be a real practitioners book that will help you survive real analytics projects in fast paced dynamic environments

You’ll find this book useful if you are:

  • a Senior Manager and you want to know that you have the right team and technology in place to deliver reproducible, tested analytics that stand up to audit and scrutiny and can be handed over easily when resources roll off your project
  • an analytics Manager who has several reports. You do want your team to be independent and agile without having to micro manage their work. You want to keep it simple so that everybody on the team can maintain data provenance and understand one another’s work without repeated inefficient hand-overs and explanations
  • a data analyst who wants to do high quality work, interact in a team but not be burdened with unnecessary process and team rules.

I’m looking forward to getting started! Stay tuned for more updates and some snippets of the book as it evolves.

Guerrilla Analytics talk at Enterprise Data World, San Diego 2013

@edwardacurry and I did a talk at Enterprise Data World 2013 in sunny San Diego. The slides are below. In this longer talk we were able to take the audience through some worked examples to illustrate how guerrilla analytics is applied in practice. Feedback was positive. There was plenty of empathy from audience members with teams that struggling with the challenges that Guerrilla Analytics addresses.

Speaker Spotlight at Enterprise Data World 2013

My interview in preparation for Enterprise Data World 2013 has just been published. The interview is pretty succinct – some opinions on recent trends and the influence of Big Data on the work I do. Looking forward to San Diego and Enterprise Data World!

Guerrilla Analytics at the Business Intelligence Congress, Orlando December 2013

@edcurry and I recently presented on Guerrilla Analytics at the Business Intelligence Congress in Orlando, Florida. The slides are here. These are some early thoughts on Guerrilla Analytics, what it is and the principles involved.