These are the 7 Guerrilla Analytics Principles.
I designed the principles to help avoid the chaos introduced by the dynamics, complexity and constraints of data projects. You will find the principles helpful if you work in Data Science, Data Mining, Statistical Analysis, Machine Learning or any field that uses these techniques.
The Guerrilla Analytics Principles have been applied successfully to many high profile and high pressure projects in domains including Financial Services, Identity and Access Management, Audit, Fraud, Customer Analytics and Forensics.
You can read more about the Guerrilla Analytics Principles in my book Guerrilla Analytics: A Practical Approach to Working with Data. Here you will find almost 100 practice tips from across the Data Science life cycle showing you how to implement the 7 principles in real-world projects.
- Principle 1: Space is cheap, confusion is expensive
- Principle 2: Prefer simple, visual project structures and conventions
- Principle 3: Prefer automation with program code
- Principle 4: Maintain a link between data on the file system, data in the analytics environment, and data in work products
- Principle 5: Version control changes to data and analytics code
- Principle 6: Consolidate team knowledge in version-controlled builds
- Principle 7: Prefer analytics code that runs from start to finish
Principle 1: Space is cheap, confusion is expensive
Data takes up storage space and costs money. Yet it is surprising how teams will scrimp on trivial amounts of storage. Teams can lose hours of project time because they didn’t back up raw data, cannot trace the source of a work product, or overwrote an older version of a work product.
This causes confusion if multiple versions are “in the wild” with the customer and cannot be reproduced.
When faced with storage considerations for data and work products, always think about the impact on loss of data provenance. Ultimately, not being able to reproduce work or explain evolving data understanding is more expensive for the team than trivial amounts of storage space.
Principle 2: Prefer simple, visual project structures and conventions
If project folder structures and analytics conventions are complex, then busy data scientists will not have time to understand and follow them.
New team members struggle to roll onto the project, team members struggle to “find things,” and they become overwhelmed when deciding where to “put things.”
The more visual and simple a project structure and convention is, the easier it is to understand and follow.
Principle 3: Prefer automation with program code
Program code explicitly describes which data is being used, how it is changed, and the correctness checks that are done. There are also analytics tools that use a graphical interface rather than program code to manipulate data.
Use of graphical tools has implications for data provenance. Think about data manipulated in a spreadsheet. A typical workflow could involve copying data to a new worksheet, dragging a formula down a column to apply it to some cells, and then pivoting the final worksheet. These manual operations of copying and dragging are not captured anywhere in the spreadsheet and so are difficult to reproduce.
As far as possible, analytics work should be done using program code for ease of understanding, automation and reproduction.
Principle 4: Maintain a link between data on the file system, data in the analytics environment, and data in work products
When aiming to preserve data provenance, it is helpful to think of data flowing through your team and your analytics environment. Imagine a data point arrives with your team (perhaps accompanied by several million other data points). The data is stored on a file system before being imported into the Data Manipulation Environment. The data is then manipulated in several ways such as cleaning, transformation, and combination with other data points. The data then leaves the team as one of the team’s work products. At each of these touch points, the data is changed by one or more team members and one or more processes.
Problems arise when a work product’s data is incorrect or an analysis is questioned by the customer. Errors could have been introduced anywhere between raw data and final work product.
Strive to maintain the traceability of data on its journey from provider, through analytics to final work product.
Principle 5: Version control changes to data and analytics code
The provenance of analytics work is easily broken. For example, a change in raw data, code or algorithm inputs would cause an analytics work product to change. This causes confusion in a highly iterative environment where the customer may have received earlier versions of work products or older versions need to be revisited.
Control changes to factors affecting data provenance by putting in place a version control system for data and analytics code.
Principle 6: Consolidate team knowledge in version-controlled builds
Understanding of the data evolves during a project. Team members will discover artefacts like duplicates. The customer may specify particular business rules.
Builds are convenience data structures that provide a Guerrilla Analytics team with centralized, tested data sets that capture team knowledge. With builds in place, team knowledge is centralized for consistency of work products and efficiency of not re-inventing the wheel.
Principle 7: Prefer analytics code that runs from start to finish
Analytics code by its nature is exploratory and iterative. A data scientist will summarize and profile data and apply many transformations to it as they seek to understand the data.
If not managed carefully, this type of work can lead to poor coding practices. Code snippets to profile data and test hypotheses linger in final work product code. These snippets clutter code files and very often break code execution. Reproducing work products is a time-consuming manual execution of code, avoid dead code snippets. Reviews are more time consuming than necessary.
Deliver analytics code that executes from start to finish to eliminate these problems while still accommodating the need for data exploration and iterative model development.