Data Science involves applying the scientific method to the discovery of opportunities and efficiencies in business data. An essential part of the scientific method is reproducibility. Reproducible Data Science is essential for scientific credibility but also improves your Data Science efficiency in 3 keys ways. If you start to apply the 7 Principles of Guerrilla Analytics your teams will quickly achieve reproducibility and benefit from these efficiencies.
Faster iterations: You can iterate quicker if your outputs are tracked, and the code version and source data for those outputs can be retrieved. This is the only way to keep track of what is an inherently complex process of changing data, changing understanding, changing requirements and combinations of inputs. With reproducibility under control, not only are you working at speed with your customer but you also gain their credibility because you are always on top of your evolving numbers and KPIs.
Faster reviews: Your team can review and hand over work more easily when everything works out of the box. There is no struggle rebuilding environments. There are no repeated conversations over what code was throw away and what code is essential to the science being reviewed. There is no need to do team forensics just to understand what exactly your team sent out the door.
Faster to production: a reproducible data processing pipeline, version controlled algorithm code and environments as code all reduce the friction in moving algorithms from a Data Science team into a production development team. Day 1: point data science code at production environment. Day 2: begin refining and refactoring without breaking functionality.
If you are curious about how reproducible Data Science was achieved and maintained in teams of up to 15 analysts on large and fast paced projects then please have a look at the book “Guerrilla Analytics: a practical approach to working with data” (USA) (UK).
Data Science is a varied mix of activities. It typically includes database design, algorithm coding, interactive analytics (like with iPython) and visualization coding as well as reporting and sharing of results. All of this is highly iterative because of the exploratory nature of Data Science where requirements and data change often.
Without some kind of control, Data Science projects quickly become impossible to manage. Code is fragmented across the languages and technologies involved. Key numbers and analyses cannot be reproduced. This is exacerbated when a team grows and a project is running at pace. The team drowns in the complexity of its own creations.
Data Science is ‘defensive’ if it can withstand the disruptions of changing data and requirements while still producing repeatable, explainable insights. Put another way, Defensive Data Science maintains data provenance. Fortunately, the Guerrilla Analytics Principles make it easy to do defensive Data Science . This blog post describes how.
Why Do Defensive Data Science?
There are many reasons to strive for ‘Defensive Data Science’. When I have run teams of up to 12 data scientists for very demanding stakeholders, maintaining data provenance has increased team efficiency and made Data Science easier to manage and maintain. Advantages include:
Reduction in time wasted in tracking data sources, data modifications and analysis outputs. Without some kind of data provenance in place, a team wastes time trying to find out where data came from, how data was modified by continuously evolving code and which of many versions of analysis were delivered to a customer.
Fewer errors. If you have data provenance in place, it is much easier to track everything that a team is doing. You know which version of your code, which version of your data, which version of your 3rd party libraries and which version of your business understanding came together to make any number that your team produced. You can stand over your analyses with confidence.
Easier sharing of work. If all the inputs and outputs of your Data Science team’s work are easy to identify then sharing of work, collaboration, handovers and on-boarding of new team members are less of a toll on the team.
How should a team do Defensive Data Science?
Fortunately, many of the challenges of defensive Data Science have already been addressed in Software Engineering. There are decades tools and techniques that can be easily adapted to the needs of defensive Data Science. Follow these steps when moving your work into a more defensive setup.
Semantic project structures. Firstly, get your project structure right. By following simple conventions on project structure, it is easy for a team to know where key project artefacts are located with minimal overhead of documentation. The team know where to put stuff and don’t spend time trying to figure this out. Some of the most important artefacts are incoming data and its versions, code and its versions, 3rd party libraries and their versions and team outputs and their versions. Guerrilla Analytics advocates simple flat project structures, giving a unique ID to incoming data and outgoing work products.
Clear data steps. Popular Data Science languages such as SQL and Python are quite free form, permitting many routes to solving a given problem. This is their strength but it is also a difficulty when it comes to managing a team solving a diverse range of problems. A given problem can be solved in several different ways. You do not want to discourage this exploratory behaviour. However, by breaking down data extraction, transformation and reshaping into clearly defined modular ‘data steps’ it becomes much easier to automate, test and modify data flows. New steps can be easily introduced, irrelevant steps removed and ‘component’ tests introduced where needed. You can have the best of both worlds – exploration and reproducibility.
Automation with pipelines. With an easily understood project structure in place and clear modular code, the next most important aspect of defensive Data Science is probably automation. Automation with pipelines means using tools (custom or 3rd party) that facilitate executing code ‘in a single click’. Since Data Science is highly iterative and also complex, working without some form of automation risks not discovering stale data bugs and broken data flows. Having easy automation available encourages ‘continuous integration’ behaviours, running all code and tests often.
Version control. While you can survive without version control tools, the ability to track all changes by a team member, develop code ‘branches’ in parallel, tag released code and roll back to earlier versions of code is essential for a team that needs to adapt to unforeseen changes in data and requirements. Version control for Data Science has some differences to version control for software development .
Documentation and workflow tracking. The previous steps go a long way towards enabling a team to produce defensive Data Science. While the overhead of documentation must be kept to a minimum in a very dynamic environment, there are some basic activities that benefit from documentation. Having a wiki makes it easy to version control changes to evolving team knowledge. Typical uses of a wiki for defensive Data Science include documenting data dictionaries and business rules. Furthermore, a simple workflow tracker for keeping tabs on what the team is doing and logging received data will make it much easier to maintain data provenance across the team’s activities.
Defensive Data Science need not be an administrative burden on a team. There are many benefits and a few simple principles go a long way to making a team easier to manage and more adaptable to dynamic project environments.
Do you have anything you wish to add? Please get in touch or have a look at Guerrilla Analytics. I’d be happy to discuss.