Data Science projects aren’t a nice clean cycle of well defined stages. More often, they are a slog towards delivery with repeated setbacks. Most steps are highly iterative between your Data Science team and IT or your Data Science team and the business. These setbacks are due to disruptions. Recognising this and identifying the cause of these disruptions is the first step in mitigating their impact on your delivery with Guerrilla Analytics.
Doing Data Science work in consulting (both internal and external) is complicated. This is for a number of reasons that have nothing to do with machine learning algorithms, statistics and math, or model sophistication. The cause of this complexity is far more mundane.
- Project requirements change often, especially as data understanding improves.
- Data is poorly understood, contains flaws you have yet to discover, IT struggle to create the required data extracts for you etc.
- Your team and the client’s team will have a variety of skills and experience
- The technology available due to licensing costs and the client’s IT landscape may not be ideal.
The discussion of Data Science workflows does not sufficiently represent this reality. Most workflow representations are derived from the Cross-Industry Standard Process for Data Mining (CRISP-DM) .
Others report variations on CRISP-DM such as the blog post referenced below .
It’s all about disruptions
These workflow representations correctly capture the high level stages of Data Science, specifically:
- defining the problem,
- acquiring data,
- preparing it,
- doing some analysis and
- reporting results
However, a more realistic representation must acknowledge that at pretty much every stage of Data Science, a variety of set backs or new knowledge can return you to any of the previous stages. You can think of these set backs and new knowledge as disruptions. They are disruptions because they necessitate modifying or redoing work instead of progressing directly to your goal of delivery. Here are some examples.
- After doing some early analyses, a data profiling exercise reveals that some of your data extract has been truncated. It takes you significant time to check that you did not corrupt the file yourself when loading it. Now you have to go all the way back to source and get another data extract.
- On creating a report, a business user highlights an unusual trend in your numbers. On investigation, you find a small bug in your code that when repaired, changes the contents of your report and requires re-issuing your report.
- On presenting some updates to a client, you together agree there is no value in the current approach and a different one must be taken. No new data is required but you must now shape the data differently to apply a different kind of algorithm and analysis.
The list goes on. The point here is that Data Science on anything beyond a toy example is going to be a highly iterative process where at every stage, your techniques and approach need to be easily modified and re-run so that your analyses and code are robust to all of those disruptions.
The Guerrilla Analytics Workflow
Here is what I term the Guerrilla Analytics workflow. You can think of it like the game of Snakes and Ladders where any unlucky move sends you back down the board.
The Guerrilla Analytics workflow considers Data Science as the following stages from source data through to delivery. I’ve also added some examples of typical disruptions at each of these stages.
|Data Science Workflow
|Extract: taking data from a source system, the web, front end system reports
- incorrect data format extracted
- truncated data
- changing requirements mean different data is required
|Receive: storing extracted data in the analytics environment and recording appropriate tracking information
- lost data
- file system mess of old data, modified data and raw data
- multiple copies of data files
|Load: transferring data from receipt location into an analytics environment
- truncation of data
- no clear link between data source and loaded datasets
|Analytics: the data preparation, reshaping, modelling and visualization needed to solve the business problem
- changing requirements
- incorrect choice of analysis or model
- dropping or overwriting records and columns so numbers cannot be explained
|Work Products and Reporting: the ad-hoc analyses and formal project deliverables
- changing requirements
- incorrect or damaged data
- code bugs
- incorrect or unsuccessful analysis
This is just a sample of the disruptions that I have experienced in my projects. I’m sure you have more to add too and it would be great to hear them.
You can learn about disruptions and the practice tips for making your Data Science robust to disruptions in my book Guerrilla Analytics: A Practical Approach to Working with Data.
 Wikipedia https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining, Accessed 2015-02-14
 Communications of the ACM Blog, http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext