Data Science is ‘defensive’ if it can withstand the disruptions of changing data and requirements while still producing repeatable, explainable insights. Put another way, Defensive Data Science maintains data provenance. Fortunately, the Guerrilla Analytics Principles make it easy to do defensive Data Science . This blog post describes how.
The danger of bias hasn’t been given enough consideration in Data Science. Bias is anything that would cause us to skew our conclusions and not treat results and evidence objectively. Bias is sometimes unavoidable, sometimes accidental and unfortunately sometimes deliberate. While bias is well recognised as a danger in mainstream science, I think Data Science could benefit from improving in this area. In this post I categorise the types of bias encountered in typical Data Science work. I have gathered these from recent blog posts , ,  and a discussion in my PhD thesis . I also show how to reduce bias using some of the principles you can learn about in Guerrilla Analytics: A Practical Approach to Working with Data.
I recently read a Harvard Business Review (HBR) article  “You need an algorithm, not a Data Scientist”. Other articles present similar arguments  . I disagree. Data Scientists and automation (data products, algorithms, production code, whatever) are complementary functions. What you actually need is a Data Scientist and then an algorithm.
McKinsey recently published at excellent guide to Machine Learning for Executives. In this post I categorise the key points that stood out from the perspective of establishing machine learning in an organisation. The key take away for me was that without leadership from the C Suite, machine learning will be limited to being a small part of existing operational processes.
Several topical questions were recently asked on Data Science Central. This post addresses the question “What best practices do you recommend, when starting and working on enterprise analytics projects?” I have worked as a Data Scientist for 8 years now. This was after completing a PhD on “Design of Experiments for Tuning Optimisation Algorithms”. So I have a formal background in rigorous experiment design for Data Science and have also managed some pretty complex and fast paced projects in sectors including Financial Services, IT, Insurance, Government and Audit.
A while back I announced an early release of similarity on GitHub in a blog post. Similarity wraps SQL Server functions around the SimMetrics approximate string matching library, making the library’s functions available in SQL Server. Version 1.1.0 has now been released and is available on GitHub. Version 1.1.0 sees several improvements aimed at making the library easier to install and use and making it easier for others to contribute.
In many cases, where Data Scientists struggle on projects has nothing to do with the technical complexity of problems or any lack of Data Science skills – they have all of that from their study and training and are quite motivated people who are passionate about their field. In fact, what makes Data Science difficult for many is the complexity of operating in a Data Science project environment.
I designed the principles to help avoid the chaos introduced by the dynamics, complexity and constraints of data projects. You will find the principles helpful if you work in Data Science, Data Mining, Statistical Analysis, Machine Learning or any field that uses these techniques.
The Guerrilla Analytics Principles have been applied successfully to many high profile and high pressure projects in domains including Financial Services, Identity and Access Management, Audit, Fraud, Customer Analytics and Forensics.
In a Guerrilla Analytics environment, available tooling is often limited. There is either not enough budget, time or IT flexibility to get all the tools you want.
On many jobs, I find myself using Microsoft SQL Server as the project RDBMS. Out of the box, SQL Server does not yet have a fuzzy match capability. You need to install additional tools such as SSIS to avail of fuzzy matching. Even then, SSIS is a GUI-driven application which contradicts a key Guerrilla Analytics Principle. In a Guerrilla Analytics environment, you would much rather have fuzzy match capabilities available in SQL code. This is where the following Similarity library comes in handy.
Here are the slides from a talk I gave today to the Information Technology Department at the National University of Ireland, Galway. Thanks to Michael Madden for the opportunity to speak. The talk was about how Guerrilla Analytics principles and practice tips help you do Data Science in circumstances that are very dynamic, constrained and […]