Similarity version 1.1.0 – Approximate String Matching for SQL Server

A while back I announced an early release of similarity on GitHub in a blog post. Similarity wraps SQL Server functions around the SimMetrics approximate string matching library, making the library’s functions available in SQL Server. Version 1.1.0 has now been released and is available on GitHub here.

Advantages of Similarity

  • this library gives you approximate string matching for free. You don’t have to resort to expensive SQL Server additions like SSIS.
  • the functions are inline in your SQL code. You don’t have to pipe your data through external tools. This is great for a Guerrilla Analytics environment where you prefer to do everything through code that can be version controlled.
  • the is a wide variety of functions to choose from. Some functions are more appropriate for certain types of matches in certain problem domains e.g. comparing URLs or comparing common names.

What’s new

Version 1.1.0 sees several improvements aimed at making the library easier to install and use and making it easier for others to contribute.

  • the entire project is now driven by an Apache Ant build file. This covers similarity as well as the original C# SimMetrics code.
  • The project uses semantic versioning.
  • there is a small set of SQL install scripts if you just want the functions.
  • the project now follows git-flow development conventions to make it easy to contribute.


‘Similarity’ Approximate String Matching library is now on GitHub


Guerrilla Analytics Challenge

In a Guerrilla Analytics environment, available tooling is often limited. There is either not enough budget, time or IT flexibility to get all the tools you want.

On many jobs, I find myself using Microsoft SQL Server as the project RDBMS. Out of the box, SQL Server does not yet have a fuzzy match capability. You need to install additional tools such as SSIS to avail of fuzzy matching. Even then, SSIS is a GUI-driven application which contradicts a key Guerrilla Analytics Principle. In a Guerrilla Analytics environment, you would much rather have fuzzy match capabilities available in SQL code. This is where the following Similarity library comes in handy.

Introducing Similarity

Similarity is a wrapper around the SimMetrics string matching library created by Sheffield University and funded by an IRC sponsored by EPSRC, grant number GR/N15764/01.

SimMetrics includes approximate string comparison algorithms such as:

  • Levenshtein
  • Jaro
  • Jaro-Winkler
  • Needleman
  • and many more

The Similarity wrapper makes these SimMetrics algorithms available in-line in SQL Server so you can call them from SQL code.

The approach for creating this wrapper was inspired by this blog post. I’ve added to the original code to produce a primitive Windows end-to-end build process that creates a SimMetrics C# DLL library and loads it into a Microsoft SQL Server database.

Go check it out

There is more information, installation instructions and the latest version at GitHub. Your contributions and comments are welcome!