In my job, I interview many potential Data Scientists and Data Analysts. I have also managed people with a wide range of experience from interns to seasoned PhDs with degrees in fields including Computer Science, Chemistry, Physics, Mathematics, Engineering and the Humanities. Just last week I had several conversations with prospective Data Scientists who are early in their careers and wondering what projects they should try to get on, what technologies they should learn and what additional courses they should study.
In many cases, where Data Scientists struggle on projects has nothing to do with the technical complexity of problems or any lack of Data Science skills – they have all of that from their study and training and are quite motivated people who are passionate about their field.
In fact, what makes Data Science difficult for many is the complexity of operating in a Data Science project environment. Specifically, a Data Scientist has to operate in an environment that looks like the following.
- Dynamics of data: data will change over the course of most projects. It will be refreshed, added to, replaced and repaired. Manual data sources are a common way of interfacing with other team members outside the Data Science team. Since much Data Science involves bringing together disparate data sources in novel ways, it is rare for all of this data to arrive at the same time and to schedule. So Data Scientists have to cope with trying to design and implement their work on top of a base of data that is always in flux.
- Dynamics of requirements: Data Science is exploratory. You really don’t know what’s in the data until you have worked with it. Typically several algorithms and analyses have to be tried out. The insights from these activities often lead to the project taking a new direction and new analyses being framed for these new requirements.
- Dynamics of people: it is rare to work in isolation. A Data Scientist will typically interact with IT, warehousing, developers, business SMEs, third party data providers and, of course, their team mates and their customer. This means that other people are providing inputs to their data, other people are writing code and creating data sets they depend on and other people are presenting results they may have contributed to. When other team members leave or take vacation, they may be expected to take over work.
- Constraints on time and resources: despite the dynamics above, the Data Scientist will be expected to add value and deliver successfully in limited time and with limited resources. You don’t always get the ideal technology stack or one that you are familiar with. You don’t always get all the skill sets you need on a project. And you don’t always get all the data for a perfect analysis.
If a Data Scientist does not have methods for coping with these dynamics and constraints then they will struggle to perform. Ultimately, they will rarely see the more advanced analytics where they can really add value.
- They become mired in forensics of their own work and their team’s work
- Time is wasted investigation and explaining inconsistencies
- Deliverables must be rewritten because the original cannot be reproduced or cannot be explained
- A team descends into reactively producing analyses rather than leading the project from their data and their deliverables
- Results are plain wrong because of the chaos that arises from project dynamics and constraints
Guerrilla Analytics and its 7 Principles provide a tried and tested operating model for Data Scientists. It has been used in many high pressure, dynamic and constrained project environments to deliver analyses that are reproducible, auditable and explainable.
This Guerrilla Analytics operating model breaks Data Science activities into the following components, highlighting the challenges faced in each component and offering guidelines on how to overcome these challenges.
- Data Extraction: how data is extracted and transported by a team in a traceable manner
- Data Receipt: how data should be received and logged by a team
- Data Load: how to load multiple versions of data into an analytics environment without breaking data provenance
- Coding: how data should be manipulated in ways that promote flexibility, testability, audit and agility. How to structure code and how to mix multiple tools and programming languages without being overwhelmed.
- Work products and Reports: how to produce multiple versions of agile work products and project milestone reports so they can be tracked easily with a customer or fellow team members
- Building consolidated analytics: how to identify and control consolidated understanding, business rules and data sets that emerge over the course of a project to promote efficiency and consistency and to avoid re-inventing the wheel
- Testing: how to test analytics code and data sets in a fast paced environment
- Workflows: simple workflows for peer review and quality control
Operating models may not produce beautiful visualizations or involve high end statistics and machine learning. However they do allow Data Scientists to hit the ground running. They provide Data Scientists with the tools they need to survive real world project environments. This is turn improves the Data Scientist’s coordination with team members, their efficiency, their credibility, and ultimately increases the opportunities to add value.
We expect methodology from traditional laboratory scientists. Let’s expect the same from Data Scientists.