‘Similarity’ Approximate String Matching library is now on GitHub


Guerrilla Analytics Challenge

In a Guerrilla Analytics environment, available tooling is often limited. There is either not enough budget, time or IT flexibility to get all the tools you want.

On many jobs, I find myself using Microsoft SQL Server as the project RDBMS. Out of the box, SQL Server does not yet have a fuzzy match capability. You need to install additional tools such as SSIS to avail of fuzzy matching. Even then, SSIS is a GUI-driven application which contradicts a key Guerrilla Analytics Principle. In a Guerrilla Analytics environment, you would much rather have fuzzy match capabilities available in SQL code. This is where the following Similarity library comes in handy.

Introducing Similarity

Similarity is a wrapper around the SimMetrics string matching library created by Sheffield University and funded by an IRC sponsored by EPSRC, grant number GR/N15764/01.

SimMetrics includes approximate string comparison algorithms such as:

  • Levenshtein
  • Jaro
  • Jaro-Winkler
  • Needleman
  • and many more

The Similarity wrapper makes these SimMetrics algorithms available in-line in SQL Server so you can call them from SQL code.

The approach for creating this wrapper was inspired by this blog post. I’ve added to the original code to produce a primitive Windows end-to-end build process that creates a SimMetrics C# DLL library and loads it into a Microsoft SQL Server database.

Go check it out

There is more information, installation instructions and the latest version at GitHub. Your contributions and comments are welcome!