Similarity version 1.1.0 – Approximate String Matching for SQL Server

A while back I announced an early release of similarity on GitHub in a blog post. Similarity wraps SQL Server functions around the SimMetrics approximate string matching library, making the library’s functions available in SQL Server. Version 1.1.0 has now been released and is available on GitHub here.

Advantages of Similarity

  • this library gives you approximate string matching for free. You don’t have to resort to expensive SQL Server additions like SSIS.
  • the functions are inline in your SQL code. You don’t have to pipe your data through external tools. This is great for a Guerrilla Analytics environment where you prefer to do everything through code that can be version controlled.
  • the is a wide variety of functions to choose from. Some functions are more appropriate for certain types of matches in certain problem domains e.g. comparing URLs or comparing common names.

What’s new

Version 1.1.0 sees several improvements aimed at making the library easier to install and use and making it easier for others to contribute.

  • the entire project is now driven by an Apache Ant build file. This covers similarity as well as the original C# SimMetrics code.
  • The project uses semantic versioning.
  • there is a small set of SQL install scripts if you just want the functions.
  • the project now follows git-flow development conventions to make it easy to contribute.


Data Scientists Need a Better Operating Model

A Data Science Operating Model

A Data Science Operating Model


In my job, I interview many potential Data Scientists and Data Analysts. I have also managed people with a wide range of experience from interns to seasoned PhDs with degrees in fields including Computer Science, Chemistry, Physics, Mathematics, Engineering and the Humanities. Just last week I had several conversations with prospective Data Scientists who are early in their careers and wondering what projects they should try to get on, what technologies they should learn and what additional courses they should study.

In many cases, where Data Scientists struggle on projects has nothing to do with the technical complexity of problems or any lack of Data Science skills – they have all of that from their study and training and are quite motivated people who are passionate about their field.

In fact, what makes Data Science difficult for many is the complexity of operating in a Data Science project environment. Specifically, a Data Scientist has to operate in an environment that looks like the following.

  • Dynamics of data: data will change over the course of most projects. It will be refreshed, added to, replaced and repaired. Manual data sources are a common way of interfacing with other team members outside the Data Science team. Since much Data Science involves bringing together disparate data sources in novel ways, it is rare for all of this data to arrive at the same time and to schedule. So Data Scientists have to cope with trying to design and implement their work on top of a base of data that is always in flux.
  • Dynamics of requirements: Data Science is exploratory. You really don’t know what’s in the data until you have worked with it. Typically several algorithms and analyses have to be tried out. The insights from these activities often lead to the project taking a new direction and new analyses being framed for these new requirements.
  • Dynamics of people: it is rare to work in isolation. A Data Scientist will typically interact with IT, warehousing, developers, business SMEs, third party data providers and, of course, their team mates and their customer. This means that other people are providing inputs to their data, other people are writing code and creating data sets they depend on and other people are presenting results they may have contributed to. When other team members leave or take vacation, they may be expected to take over work.
  • Constraints on time and resources: despite the dynamics above, the Data Scientist will be expected to add value and deliver successfully in limited time and with limited resources. You don’t always get the ideal technology stack or one that you are familiar with. You don’t always get all the skill sets you need on a project. And you don’t always get all the data for a perfect analysis.

If a Data Scientist does not have methods for coping with these dynamics and constraints then they will struggle to perform. Ultimately, they will rarely see the more advanced analytics where they can really add value.

  • They become mired in forensics of their own work and their team’s work
  • Time is wasted investigation and explaining inconsistencies
  • Deliverables must be rewritten because the original cannot be reproduced or cannot be explained
  • A team descends into reactively producing analyses rather than leading the project from their data and their deliverables
  • Results are plain wrong because of the chaos that arises from project dynamics and constraints

A Data Science Operating Model with Guerrilla Analytics

Guerrilla Analytics and its 7 Principles provide a tried and tested operating model for Data Scientists. It has been used in many high pressure, dynamic and constrained project environments to deliver analyses that are reproducible, auditable and explainable.

This Guerrilla Analytics operating model breaks Data Science activities into the following components, highlighting the challenges faced in each component and offering guidelines on how to overcome these challenges.

  • Data Extraction: how data is extracted and transported by a team in a traceable manner
  • Data Receipt: how data should be received and logged by a team
  • Data Load: how to load multiple versions of data into an analytics environment without breaking data provenance
  • Coding: how data should be manipulated in ways that promote flexibility, testability, audit and agility. How to structure code and how to mix multiple tools and programming languages without being overwhelmed.
  • Work products and Reports: how to produce multiple versions of agile work products and project milestone reports so they can be tracked easily with a customer or fellow team members
  • Building consolidated analytics: how to identify and control consolidated understanding, business rules and data sets that emerge over the course of a project to promote efficiency and consistency and to avoid re-inventing the wheel
  • Testing: how to test analytics code and data sets in a fast paced environment
  • Workflows: simple workflows for peer review and quality control

Operating models may not produce beautiful visualizations or involve high end statistics and machine learning. However they do allow Data Scientists to hit the ground running. They provide Data Scientists with the tools they need to survive real world project environments. This is turn improves the Data Scientist’s coordination with team members, their efficiency, their credibility, and ultimately increases the opportunities to add value.

We expect methodology from traditional laboratory scientists. Let’s expect the same from Data Scientists.

The Guerrilla Analytics Principles


There is now a page on giving an overview of the 7 Guerrilla Analytics Principles.

I designed the principles to help avoid the chaos introduced by the dynamics, complexity and constraints of data projects. You will find the principles helpful if you work in Data Science, Data Mining, Statistical Analysis, Machine Learning or any field that uses these techniques.

The Guerrilla Analytics Principles have been applied successfully to many high profile and high pressure projects in domains including Financial Services, Identity and Access Management, Audit, Fraud, Customer Analytics and Forensics.

You can read more about the Guerrilla Analytics Principles in my book Guerrilla Analytics: A Practical Approach to Working with Data. Here you will find almost 100 practice tips from across the Data Science life cycle showing you how to implement these principles in real-world situations.

Do you have your own data science experiences and principles? Let me know by getting in touch!