Guerrilla Analytics: Tactics for Coping with Data Science Reality

man-65049_1920

Here are the slides from a talk I gave today to the Information Technology Department at the National University of Ireland, Galway. Thanks to Michael Madden for the opportunity to speak.

The talk was about how Guerrilla Analytics principles and practice tips help you do Data Science in circumstances that are very dynamic, constrained and yet required traceability of what you do.

There were plenty of questions afterwards which is always encouraging. I’ll try to address these questions in subsequent blog posts so please do follow me @enda_ridge for all the latest posts.

Here are some of the questions from today.

  • what are the key skills to focus on if you want to work in data analytics / data science?
  • is programming ability a pre-requisite for doing data science? This question came up before at Newcastle University.
  • do the guerrilla analytics principles map to research projects?
  • do the guerrilla analytics principles map to ‘big data’ projects?

Since NUI Galway is a bi-lingual university, you can find my broken Gaelic version below!

As Gaeilge

Seo h-iad na sleamhnáin ó léacht a bhí agam inniú sa Roinn Teicneolaíocht Fáisnéise in Ollscoil na h-Éireann, Gaillimh. Buíochas le Michael Madden as an deis labhairt.

Bhain an léacht le cén chaoi is féidir leis na  prionsabail agus noda Guerrilla Analytics cabhair leat agus tú ag déanamh Data Science i ndálaí atá dinimic, srianta ach fós tá sé riachtanach go bhfuil inrianaitheacht ann.

Bhí mórán ceisteanna tar éis an léacht agus is maith an rud é. Freagróidh mé iad i mblag eile agus bígí cinnte mé a leanacht ag @enda_ridge don scéal is déanaí.

Seo h-iad roinnt de na ceisteanna.

  • céard iad na scilleanna is tábhachtaí agus tú ag iarraidh obair mar data scientist?
  • an gá duit bheith in ann ríomhchlárú le h-aghaidh obair mar data scientist?
  • an bhfuil baint ann idir na prionsabail agus tionscadail taighde?
  • an bhfuil baint ann idir na prionsabail agus ‘Big Data’?

Data Science Workflows – A Reality Check

workflow

Data Science projects aren’t a nice clean cycle of well defined stages. More often, they are a slog towards delivery with repeated setbacks. Most steps are highly iterative between your Data Science team and IT or your Data Science team and the business. These setbacks are due to disruptions. Recognising this and identifying the cause of these disruptions is the first step in mitigating their impact on your delivery with Guerrilla Analytics.

The Situation

Doing Data Science work in consulting (both internal and external) is complicated. This is for a number of reasons that have nothing to do with machine learning algorithms, statistics and math, or model sophistication. The cause of this complexity is far more mundane.

  • Project requirements change often, especially as data understanding improves.
  • Data is poorly understood, contains flaws you have yet to discover, IT struggle to create the required data extracts for you etc.
  • Your team and the client’s team will have a variety of skills and experience
  • The technology available due to licensing costs and the client’s IT landscape may not be ideal.

The discussion of Data Science workflows does not sufficiently represent this reality. Most workflow representations are derived from the Cross-Industry Standard Process for Data Mining (CRISP-DM) [1].

CRISP-DM_Process_Diagram

Others report variations on CRISP-DM such as the blog post referenced below [2].

rp-overview

It’s all about disruptions

These workflow representations correctly capture the high level stages of Data Science, specifically:

  • defining the problem,
  • acquiring data,
  • preparing it,
  • doing some analysis and
  • reporting results

However, a more realistic representation must acknowledge that at pretty much every stage of Data Science, a variety of set backs or new knowledge can return you to any of the previous stages. You can think of these set backs and new knowledge as disruptions. They are disruptions because they necessitate modifying or redoing work instead of progressing directly to your goal of delivery. Here are some examples.

  • After doing some early analyses, a data profiling exercise reveals that some of your data extract has been truncated. It takes you significant time to check that you did not corrupt the file yourself when loading it. Now you have to go all the way back to source and get another data extract.
  • On creating a report, a business user highlights an unusual trend in your numbers. On investigation, you find a small bug in your code that when repaired, changes the contents of your report and requires re-issuing your report.
  • On presenting some updates to a client, you together agree there is no value in the current approach and a different one must be taken. No new data is required but you must now shape the data differently to apply a different kind of algorithm and analysis.

The list goes on. The point here is that Data Science on anything beyond a toy example is going to be a highly iterative process where at every stage, your techniques and approach need to be easily modified and re-run so that your analyses and code are robust to all of those disruptions.

The Guerrilla Analytics Workflow

Here is what I term the Guerrilla Analytics workflow. You can think of it like the game of Snakes and Ladders where any unlucky move sends you back down the board.

image

The Guerrilla Analytics workflow considers Data Science as the following stages from source data through to delivery. I’ve also added some examples of typical disruptions at each of these stages.

Data Science Workflow Example Disruptions
Extract: taking data from a source system, the web, front end system reports
  • incorrect data format extracted
  • truncated data
  • changing requirements mean different data is required
Receive: storing extracted data in the analytics environment and recording appropriate tracking information
  • lost data
  • file system mess of old data, modified data and raw data
  • multiple copies of data files
Load: transferring data from receipt location into an analytics environment
  • truncation of data
  • no clear link between data source and loaded datasets
Analytics: the data preparation, reshaping, modelling and visualization needed to solve the business problem
  • changing requirements
  • incorrect choice of analysis or model
  • dropping or overwriting records and columns so numbers cannot be explained
Work Products and Reporting: the ad-hoc analyses and formal project deliverables
  • changing requirements
  • incorrect or damaged data
  • code bugs
  • incorrect or unsuccessful analysis

This is just a sample of the disruptions that I have experienced in my projects. I’m sure you have more to add too and it would be great to hear them.

Further Reading

You can learn about disruptions and the practice tips for making your Data Science robust to disruptions in my book Guerrilla Analytics: A Practical Approach to Working with Data.

References

[1] Wikipedia https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining, Accessed 2015-02-14

[2] Communications of the ACM Blog, http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext