‘Similarity’ Approximate String Matching library is now on GitHub
Guerrilla Analytics Challenge
In a Guerrilla Analytics environment, available tooling is often limited. There is either not enough budget, time or IT flexibility to get all the tools you want.
On many jobs, I find myself using Microsoft SQL Server as the project RDBMS. Out of the box, SQL Server does not yet have a fuzzy match capability. You need to install additional tools such as SSIS to avail of fuzzy matching. Even then, SSIS is a GUI-driven application which contradicts a key Guerrilla Analytics Principle. In a Guerrilla Analytics environment, you would much rather have fuzzy match capabilities available in SQL code. This is where the following Similarity library comes in handy.
SimMetrics includes approximate string comparison algorithms such as:
- and many more
The approach for creating this wrapper was inspired by this blog post. I’ve added to the original code to produce a primitive Windows end-to-end build process that creates a SimMetrics C# DLL library and loads it into a Microsoft SQL Server database.
Go check it out
There is more information, installation instructions and the latest version at GitHub. Your contributions and comments are welcome!