Reproducible Data Science: faster iterations, reviews and production

Data Science involves applying the scientific method to the discovery of opportunities and efficiencies in business data. An essential part of the scientific method is reproducibility. Reproducible Data Science is essential for scientific credibility but also improves your Data Science efficiency in 3 keys ways. If you start to apply the 7 Principles of Guerrilla Analytics your teams will quickly achieve reproducibility and benefit from these efficiencies.

[su_spacer size=”40″]

  1. Faster iterations: You can iterate quicker if your outputs are tracked, and the code version and source data for those outputs can be retrieved. This is the only way to keep track of what is an inherently complex process of changing data, changing understanding, changing requirements and combinations of inputs. With reproducibility under control, not only are you working at speed with your customer but you also gain their credibility because you are always on top of your evolving numbers and KPIs.
    [su_spacer size=”20″]
  2. Faster reviews: Your team can review and hand over work more easily when everything works out of the box. There is no struggle rebuilding environments. There are no repeated conversations over what code was throw away and what code is essential to the science being reviewed. There is no need to do team forensics just to understand what exactly your team sent out the door.
    [su_spacer size=”20″]
  3. Faster to production: a reproducible data processing pipeline, version controlled algorithm code and environments as code all reduce the friction in moving algorithms from a Data Science team into a production development team. Day 1: point data science code at production environment. Day 2: begin refining and refactoring without breaking functionality.

[su_spacer size=”40″]

If you are curious about how reproducible Data Science was achieved and maintained in teams of up to 15 analysts on large and fast paced projects then please have a look at the book “Guerrilla Analytics: a practical approach to working with data” (USA) (UK).

A Definition of Data Science for Business

Vague mixes of skill sets. A focus on activities and technology. Bizarre Venn diagrams. It seems there is huge confusion over what Data Science is. Is it Big Data? Isn’t it statistics? Is it something else entirely? This confusion causes untold problems. It leads to vendor and recruiter hype. It leads to inflated career expectations from those who work with data. It leads to rebranding of solid, established and much-needed fields like Analytic, Business Intelligence and Statistics.

Wouldn’t it be better if you could clearly state what you do as a Data Scientist? You probably agree your work life would be easier if your colleagues and customers could understand what you do.

  • A biologist wouldn’t say they are a biologist because they work with petri dishes as opposed to experiments to understand life. However some Data Science definitions focus on use of tools like Hadoop.
  • A physicist wouldn’t say they are a physicist because they run simulations of their models as opposed to understanding matter. However some Data Science definitions focus on activities like modelling, data cleaning and visualizations.
  • All these sciences use statistics to design their experiments and test their hypotheses. Yet some Data Science definitions focus on overlaps of statistics with computer science and unicorns.

[su_spacer]A Definition of Data Science

The secret to defining data science is to focus on the science. Here is a simple definition of Data Science:

Data Science is the application of the scientific method to find opportunities and efficiencies in business data

There are a few things to note about this definition:

  • it’s technology agnostic. It’s not about Big Data, Hadoop or whatever the next technology breakthrough might be.
  • it’s applied to finding opportunities and efficiencies in data. It’s not the study of data – that’s statistics.
  • it’s not about activities that may be part of the lifecycle of working with data.
  • most importantly, it uses the scientific method, “systematic observation, measurement, and experiment, and the formulation, testing, and modification of hypotheses” [1].

The application of the scientific method is central to data science and something I want to come back to in a more detailed post.