Reproducible Data Science: faster iterations, reviews and production
Data Science involves applying the scientific method to the discovery of opportunities and efficiencies in business data. An essential part of the scientific method is reproducibility. Reproducible Data Science is essential for scientific credibility but also improves your Data Science efficiency in 3 keys ways. If you start to apply the 7 Principles of Guerrilla Analytics your teams will quickly achieve reproducibility and benefit from these efficiencies.
- Faster iterations: You can iterate quicker if your outputs are tracked, and the code version and source data for those outputs can be retrieved. This is the only way to keep track of what is an inherently complex process of changing data, changing understanding, changing requirements and combinations of inputs. With reproducibility under control, not only are you working at speed with your customer but you also gain their credibility because you are always on top of your evolving numbers and KPIs.
- Faster reviews: Your team can review and hand over work more easily when everything works out of the box. There is no struggle rebuilding environments. There are no repeated conversations over what code was throw away and what code is essential to the science being reviewed. There is no need to do team forensics just to understand what exactly your team sent out the door.
- Faster to production: a reproducible data processing pipeline, version controlled algorithm code and environments as code all reduce the friction in moving algorithms from a Data Science team into a production development team. Day 1: point data science code at production environment. Day 2: begin refining and refactoring without breaking functionality.
If you are curious about how reproducible Data Science was achieved and maintained in teams of up to 15 analysts on large and fast paced projects then please have a look at the book “Guerrilla Analytics: a practical approach to working with data” (USA) (UK).