I was invited to speak at Predictive Analytics World 2015 in London on October 28th 2015.
My talk covered how the 7 Guerrilla Analytics Principles are the foundation for doing Agile Data Science. With a Data Science Operating Model that follows these principles, your team always know where their data came from, who changed it and why and can explain any of the highly iterative explorations and analyses their customers require.
You can find the slides below and at Slideshare. As always, feedback and questions are welcome. Enjoy!
Are you a data scientist working on a project with constantly changing requirements, flawed changing data and other disruptions? Guerrilla Analytics can help.
The key to a high performing Guerrilla Analytics team is its ability to recognise common data preparation patterns and quickly implement them in flexible, defensive data sets.
After this webinar, you’ll be able to get your team off the ground fast and begin demonstrating value to your stakeholders.
You will learn about:
* Guerrilla Analytics: a brief introduction to what it is and why you need it for your agile data science ambitions
* Data Science Patterns: what they are and how they enable agile data science
* Case study: a walk through of some common patterns in use inreal projects
I recently gave a webinar on Data Science Patterns. The slides are here.
Data Science Patterns, as with Software Engineering Patterns, are ‘common solutions to recurring problems’. I was inspired to put this webinar together based on a few things.
- I build Data Science teams. Repeatedly, you find teams working inconsistently in terms of the data preparation approaches, structures and conventions they use. Patterns help resolve this problem. Without patterns, you end up with code maintenance challenges, difficulty in supporting junior team members and all round team inefficiency due to having a completely ad-hoc approach to data preparation.
- I read a recent paper ‘Tidy Data’ by Hadley Wickham in the Journal of Statistical Software http://vita.had.co.nz/papers/tidy-data.pdf. This paper gives an excellent clear description of what ‘tidy data’ is – the data format used by most Data Science algorithms and visualizations. While there isn’t anything new here if you have a computer science background, Wickham’s paper is an easy read and has some really clear worked examples.
- My book, Guerrilla Analytics (here for USA or here for UK), has an entire appendix on data manipulation patterns and I wanted to share some of that thinking with the Data Science community.
I hope you enjoy the webinar and find it useful. You can hear the recording here. Do get in touch with your thoughts and comments as I think Data Science patterns is a huge area of potential improvement in the maturity of Data Science as a field.
In many cases, where Data Scientists struggle on projects has nothing to do with the technical complexity of problems or any lack of Data Science skills – they have all of that from their study and training and are quite motivated people who are passionate about their field. In fact, what makes Data Science difficult for many is the complexity of operating in a Data Science project environment.
In my job, I interview many potential Data Scientists and Data Analysts. I have also managed people with a wide range of experience from interns to seasoned PhDs with degrees in fields including Computer Science, Chemistry, Physics, Mathematics, Engineering and the Humanities. Just last week I had several conversations with prospective Data Scientists who are early in their careers and wondering what projects they should try to get on, what technologies they should learn and what additional courses they should study.
In many cases, where Data Scientists struggle on projects has nothing to do with the technical complexity of problems or any lack of Data Science skills – they have all of that from their study and training and are quite motivated people who are passionate about their field.
In fact, what makes Data Science difficult for many is the complexity of operating in a Data Science project environment. Specifically, a Data Scientist has to operate in an environment that looks like the following.
- Dynamics of data: data will change over the course of most projects. It will be refreshed, added to, replaced and repaired. Manual data sources are a common way of interfacing with other team members outside the Data Science team. Since much Data Science involves bringing together disparate data sources in novel ways, it is rare for all of this data to arrive at the same time and to schedule. So Data Scientists have to cope with trying to design and implement their work on top of a base of data that is always in flux.
- Dynamics of requirements: Data Science is exploratory. You really don’t know what’s in the data until you have worked with it. Typically several algorithms and analyses have to be tried out. The insights from these activities often lead to the project taking a new direction and new analyses being framed for these new requirements.
- Dynamics of people: it is rare to work in isolation. A Data Scientist will typically interact with IT, warehousing, developers, business SMEs, third party data providers and, of course, their team mates and their customer. This means that other people are providing inputs to their data, other people are writing code and creating data sets they depend on and other people are presenting results they may have contributed to. When other team members leave or take vacation, they may be expected to take over work.
- Constraints on time and resources: despite the dynamics above, the Data Scientist will be expected to add value and deliver successfully in limited time and with limited resources. You don’t always get the ideal technology stack or one that you are familiar with. You don’t always get all the skill sets you need on a project. And you don’t always get all the data for a perfect analysis.
If a Data Scientist does not have methods for coping with these dynamics and constraints then they will struggle to perform. Ultimately, they will rarely see the more advanced analytics where they can really add value.
- They become mired in forensics of their own work and their team’s work
- Time is wasted investigation and explaining inconsistencies
- Deliverables must be rewritten because the original cannot be reproduced or cannot be explained
- A team descends into reactively producing analyses rather than leading the project from their data and their deliverables
- Results are plain wrong because of the chaos that arises from project dynamics and constraints
Guerrilla Analytics and its 7 Principles provide a tried and tested operating model for Data Scientists. It has been used in many high pressure, dynamic and constrained project environments to deliver analyses that are reproducible, auditable and explainable.
This Guerrilla Analytics operating model breaks Data Science activities into the following components, highlighting the challenges faced in each component and offering guidelines on how to overcome these challenges.
- Data Extraction: how data is extracted and transported by a team in a traceable manner
- Data Receipt: how data should be received and logged by a team
- Data Load: how to load multiple versions of data into an analytics environment without breaking data provenance
- Coding: how data should be manipulated in ways that promote flexibility, testability, audit and agility. How to structure code and how to mix multiple tools and programming languages without being overwhelmed.
- Work products and Reports: how to produce multiple versions of agile work products and project milestone reports so they can be tracked easily with a customer or fellow team members
- Building consolidated analytics: how to identify and control consolidated understanding, business rules and data sets that emerge over the course of a project to promote efficiency and consistency and to avoid re-inventing the wheel
- Testing: how to test analytics code and data sets in a fast paced environment
- Workflows: simple workflows for peer review and quality control
Operating models may not produce beautiful visualizations or involve high end statistics and machine learning. However they do allow Data Scientists to hit the ground running. They provide Data Scientists with the tools they need to survive real world project environments. This is turn improves the Data Scientist’s coordination with team members, their efficiency, their credibility, and ultimately increases the opportunities to add value.
We expect methodology from traditional laboratory scientists. Let’s expect the same from Data Scientists.