Building a Data Science Capability: Inspirational Keynote at Data Leaders Summit Europe

I’ve just delivered the inspirational keynote at Data Leaders Summit Europe, 2018. Lots of great engagement and feedback. In particular, it seems people liked a clear definition of what data science actually is and the practical steps (miss-steps) I took in building a capability at Sainsbury’s.

Continue reading “Building a Data Science Capability: Inspirational Keynote at Data Leaders Summit Europe”

13 Steps to Better Data Science: A Joel Test of Data Science Maturity

Data Science teams have different levels of maturity in terms of their ways of working. In the worst case, every team member works as an individual. Results are poorly explained and impossible to reproduce. In the best case, teams reach full scientific reproducibility with simple conventions and little overhead. This leads to efficiency and confidence in results and minimal friction in productionising models. It is important to be able to measure a team’s maturity so that you can improve your ways of working and so you can attract and retain great talent. This series of questions is a Joel Test of Data Science Maturity. As with Joel’s original test for software development, all questions are a simple Yes/No and a score below 10 is cause for concern. Depressingly, many teams seem to struggle around a 3.

Continue reading “13 Steps to Better Data Science: A Joel Test of Data Science Maturity”

Data Science – A Definition And How To Get Started

Confusion, hype, failure to start. Data Science has huge potential to change an organisation. But many organisations become mired in the associated cultural, technological and people change. Data Science is delivered as an interesting report rather than a driver of change. Data Science identifies algorithms that are run in the safety of a lab but never make it into production.

This is my keynote talk from the Polish Data Science with Business Conference.

Confusion, hype, failure to start. Data Science has huge potential to change an organisation. But many organisations become mired in the associated cultural, technological and people change. Data Science is delivered as an interesting report rather than a driver of change. Data Science identifies algorithms that are run in the safety of a lab but never make it into production.

This is my keynote talk from the Polish Data Science with Business Conference.

Continue reading “Data Science – A Definition And How To Get Started”

10 Data Science Capabilities Your Team Needs (and the Tools to Support Them)

People want some guidance in what can be a very overwhelming and fast moving field. Managers want to know what to buy and where to invest training. Junior data scientists, students, dev ops engineers and system administrators want to know what to learn. I will focus on the important capabilities for a Data Science team and the tools I have found useful for enabling those capabilities.

I’ve recently had several people ask me about tools for Data Science both online and at conference talks. This post lists some of the tools I use and the capabilities they provide.

When writing Guerrilla Analytics: A Practical Approach to Working with Data I deliberately avoided mention of tools. People can be dogmatic about tools and I thought this would be a distraction from the book’s core message around principles for doing effective Data Science in dynamic real-world projects.

That said, people do want some guidance in what can be a very overwhelming and fast moving field. Managers want to know what to buy and where to invest training. Junior data scientists, students, dev ops engineers and system administrators want to know what to learn. I will focus on the important capabilities for a Data Science team and the tools I have found useful for enabling those capabilities.

1. Version control with Git and git-flow

Capability: Typically you will go through many iterations of your code and the work products your code produces. It quickly becomes impossible to track changes and reproduce earlier work without some code version control tool. This is only exacerbated when your team size is >1.

Tool: Git is a great version control system and the effort to learn its command line interface is a very worthwhile investment.

Git is incredibly flexible. However this can lead to confusion and inconsistency in how it is applied. Git-flow is a set of scripts that automate much of what you will need to do in Git subject to a particular convention that happens to be very helpful for Data Science.

2. Wrangling and persisting data with PostgreSQL

Capability:  Even if your data is small enough to fit in memory, reproducing work will involve running all those scripts into memory before you can pick up where you left off. Other team members have to do the same. This is painful and inefficient. You therefore need to persist your work (raw data, intermediate datasets and work products).

Tool: A database gives you a way to persist your workings and intermediate datasets as well as share with team members. Pick a database the is performant and flexible. I use PostgreSQL. It has an amazing set of features and this flexibility is what you want when doing Data Science.

3. Wrangling and visualizing data with PandasMatplotlib and Seaborn

Capability: Getting your head around your data and preparing it for a variety of algorithms is probably the most time-consuming and important part of the Data Science life cycle. Some preparations are easier done outside of many databases e.g. some natural language processing. Visualizing the data is really important here too.

Tool: Pick a programming language that has great data reshaping and visualization capabilities. If you work in Python, Pandas is a powerful set of data structures and algorithms for wrangling. Seaborn and Matplotlib are good places to start for visualization. And don’t waste time trying to get all these things to work together. Just use Continuum’s excellent distribution Anaconda.

Read: Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython

4. Documentation with Markdown

Capability: Data Science is useless without communication (to your customer and within your team). You could just write a report as a Word document. There’s nothing wrong with that and it’s a format your business customers will expect. However, it would be great to have a documentation that is easy to version control and can be kept close to your project code.

Tool: Markdown is a nice platform-neutral way to document your project. Because it’s plain text it’s easy to version control (see above). And if your report isn’t too complicated you can convert it to Word from Markdown. Win.

5. Fast file manipulation, cleaning, and summarising at the command line

Capability: You get hundreds of data files. You get huge files in strange formats with broken delimiters. You want to chop these up, patch them together, change their encodings, unravel XML etc etc. No, trying to open the file in a text editor or spreadsheet is not the answer.

Tool: This is best done at a powerful command line. Linux is worth learning.

Read: Data Science at the Command Line: Facing the Future with Time-Tested Tools

6. Story telling with Jupyter Notebooks

Capability: Data Science is difficult to communicate. It’s often a slightly meandering journey with dead ends, back-tracking, unexpected insights leading to new research avenues etc. When updating your customer, you need to walk them through some of this journey using narratives interleaved with graphics and tabular data. Code files won’t do. Duplicating into Powerpoint is a lot of extra work for a quick interim update.

Tool: Jupyter allows all of the above in presentation quality. The close interleaving of analysis and documentation helps other team members join a project. And it reduces duplication when you decide it’s time to stop coding and start updating your customer.

7. Build automation with Luigi

Capability: eventually, your understanding and your code start to consolidate. There are some core datasets. They go through some agreed preparatory steps. There are some reports and algorithm datasets that you want to lock down and reproduce several times during the project. Manually running all those code files is a pain.

Tool: build automation tools allow you to automate tasks such as executing code files, creating documentation, importing and exporting data etc etc. I’ve used command line scripts (see above) and software build tools like Ant for this automation. More sophisticated tools like Luigi are now reaching a level of maturity where you could consider them for your team too.

8. Workflow tracking with JIRA

Capability: what the hell is everybody doing? Where did that data come from? Where is the conversation with the system SME that led to that business rule? Where is the deliverable from 2 weeks ago and who sent it to which customer?

Tool: workflow tracking tools like JIRA help answer all the above questions. Look for a tool that is customizable as Data Science doesn’t need all the detail of a large scale software development project. Do make sure you track where your data is coming from and what deliverables are going out the door (see Guerrilla Analytics).

9. Packaging it all up with Vagrant

Capability: the diverse nature of Data Science activities leads to a correspondingly diverse set of tools as you’ve seen above. When you get things working, you would rather not break them and you would rather not force every team member to go through the same painful installations and configurations and risk inconsistency.

Tool: Vagrant and other ‘dev ops’ tools allow you to define your tech setups and their configuration in program code. What does that mean? It means that you can build your entire technology stack and configure it by running some code. It also means that the installation of all your tools and their configuration can be version controlled. As your technology stack evolves, update your code and issue a new release to your team. If you trash your technology or need to move to other servers, everything you need to reproduce your environment has been captured and you should be back up and running in minutes.

Read: Vagrant: Up and Running

10. Putting it all together – Operations with Guerrilla Analytics

I’ve covered a lot here. How do you put this all together without choking a team in conventions, rules, tools etc? How do you reduce Data Science chaos and continue to deliver iteratively and at pace? That’s where Guerrilla Analytics can help.

Have a read around this blog, check out the book and get in touch with any questions!

10 Data Science Capabilities (and supporting tools)

People want some guidance in what can be a very overwhelming and fast moving field. Managers want to know what to buy and where to invest training. Junior data scientists, students, dev ops engineers and system administrators want to know what to learn. I will focus on the important capabilities for a Data Science team and the tools I have found useful for enabling those capabilities.

Data Scientists and data science managers want some guidance on supporting tools choice for their data science capabilities. Managers want to know what to buy and where to invest training. Junior data scientists, students, dev ops engineers and system administrators want to know what to learn. I will focus on the important capabilities for a Data Science team and the tools I have found useful for enabling those capabilities.

Continue reading “10 Data Science Capabilities (and supporting tools)”

Irish Language Data Science lecture at Engineers Ireland

I gave the following lecture to Engineers Ireland which is the professional body for Engineers. It’s about “Data Science and the benefits for engineering” and is entirely in Irish.

It was an interesting exercise to brush up on my Gaelic and also to see the wealth or resources that now exist for using Irish with modern technical vocabulary. If you are curious or are trying to get your Gaelic up to scratch then please get in touch!

I may be the first person to coin Data Science in Gaeilge!

I gave the following lecture to Engineers Ireland which is the Irish professional body for Engineers. The lecture is about “Data Science and the benefits for engineering” and is entirely in Irish.

It was an interesting exercise to brush up on my Gaelic and also to see the wealth of resources that now exist for using Irish with modern technical vocabulary. If you are curious or are trying to get your Gaelic up to scratch then please get in touch!

In terms of content, it covers what data and data science look like and how traditional engineering problems might benefit from the application of data science.

The full video is linked below.

And here are the slides 2016-04 Engineering Ireland_04_Gaeilge.

Data Science Patterns: Preparing Data for Agile Data Science

Are you a data scientist working on a project with constantly changing requirements, flawed changing data and other disruptions? Guerrilla Analytics can help.

The key to a high performing Guerrilla Analytics team is its ability to recognise common data preparation patterns and quickly implement them in flexible, defensive data sets.

After this webinar, you’ll be able to get your team off the ground fast and begin demonstrating value to your stakeholders.

You will learn about:
* Guerrilla Analytics: a brief introduction to what it is and why you need it for your agile data science ambitions
* Data Science Patterns: what they are and how they enable agile data science
* Case study: a walk through of some common patterns in use inreal projects

I recently gave a webinar on Data Science Patterns. The slides are here.

Data Science Patterns, as with Software Engineering Patterns, are ‘common solutions to recurring problems’. I was inspired to put this webinar together based on a few things.

  • I build Data Science teams. Repeatedly, you find teams working inconsistently in terms of the data preparation approaches, structures and conventions they use. Patterns help resolve this problem. Without patterns, you end up with code maintenance challenges, difficulty in supporting junior team members and all round team inefficiency due to having a completely ad-hoc approach to data preparation.
  • I read a recent paper ‘Tidy Data’ by Hadley Wickham in the Journal of Statistical Software http://vita.had.co.nz/papers/tidy-data.pdf. This paper gives an excellent clear description of what ‘tidy data’ is – the data format used by most Data Science algorithms and visualizations. While there isn’t anything new here if you have a computer science background, Wickham’s paper is an easy read and has some really clear worked examples.
  • My book, Guerrilla Analytics (here for USA or here for UK), has an entire appendix on data manipulation patterns and I wanted to share some of that thinking with the Data Science community.

I hope you enjoy the webinar and find it useful. You can hear the recording here. Do get in touch with your thoughts and comments as I think Data Science patterns is a huge area of potential improvement in the maturity of Data Science as a field.

Guerrilla Analytics for Defensive Data Science

Data Science is ‘defensive’ if it can withstand the disruptions of changing data and requirements while still producing repeatable, explainable insights. Put another way, Defensive Data Science maintains data provenance. Fortunately, the Guerrilla Analytics Principles make it easy to do defensive Data Science [1]. This blog post describes how.

defensive data science

Data Science is a varied mix of activities. It typically includes database design, algorithm coding, interactive analytics (like with iPython) and visualization coding as well as reporting and sharing of results. All of this is highly iterative because of the exploratory nature of Data Science where requirements and data change often.

Without some kind of control, Data Science projects quickly become impossible to manage. Code is fragmented across the languages and technologies involved. Key numbers and analyses cannot be reproduced. This is exacerbated when a team grows and a project is running at pace. The team drowns in the complexity of its own creations.

Data Science is ‘defensive’ if it can withstand the disruptions of changing data and requirements while still producing repeatable, explainable insights. Put another way, Defensive Data Science maintains data provenance. Fortunately, the Guerrilla Analytics Principles make it easy to do defensive Data Science [1]. This blog post describes how.

Why Do Defensive Data Science?

There are many reasons to strive for ‘Defensive Data Science’. When I have run teams of up to 12 data scientists for very demanding stakeholders, maintaining data provenance has increased team efficiency and made Data Science easier to manage and maintain. Advantages include:

  • Reduction in time wasted in tracking data sources, data modifications and analysis outputs. Without some kind of data provenance in place, a team wastes time trying to find out where data came from, how data was modified by continuously evolving code and which of many versions of analysis were delivered to a customer.
  • Fewer errors. If you have data provenance in place, it is much easier to track everything that a team is doing. You know which version of your code, which version of your data, which version of your 3rd party libraries and which version of your business understanding came together to make any number that your team produced. You can stand over your analyses with confidence.
  • Easier sharing of work. If all the inputs and outputs of your Data Science team’s work are easy to identify then sharing of work, collaboration, handovers and on-boarding of new team members are less of a toll on the team.

How should a team do Defensive Data Science?

Fortunately, many of the challenges of defensive Data Science have already been addressed in Software Engineering. There are decades tools and techniques that can be easily adapted to the needs of defensive Data Science. Follow these steps when moving your work into a more defensive setup.

  1. Semantic project structures. Firstly, get your project structure right. By following simple conventions on project structure, it is easy for a team to know where key project artefacts are located with minimal overhead of documentation. The team know where to put stuff and don’t spend time trying to figure this out. Some of the most important artefacts are incoming data and its versions, code and its versions, 3rd party libraries and their versions and team outputs and their versions. Guerrilla Analytics advocates simple flat project structures, giving a unique ID to incoming data and outgoing work products.
  2. Clear data steps. Popular Data Science languages such as SQL and Python are quite free form, permitting many routes to solving a given problem. This is their strength but it is also a difficulty when it comes to managing a team solving a diverse range of problems. A given problem can be solved in several different ways. You do not want to discourage this exploratory behaviour. However, by breaking down data extraction, transformation and reshaping into clearly defined modular ‘data steps’ it becomes much easier to automate, test and modify data flows. New steps can be easily introduced, irrelevant steps removed and ‘component’ tests introduced where needed. You can have the best of both worlds – exploration and reproducibility.
  3. Automation with pipelines.  With an easily understood project structure in place and clear modular code, the next most important aspect of defensive Data Science is probably automation. Automation with pipelines means using tools (custom or 3rd party) that facilitate executing code ‘in a single click’. Since Data Science is highly iterative and also complex, working without some form of automation risks not discovering stale data bugs and broken data flows. Having easy automation available encourages ‘continuous integration’ behaviours, running all code and tests often.
  4. Version control. While you can survive without version control tools, the ability to track all changes by a team member, develop code ‘branches’ in parallel, tag released code and roll back to earlier versions of code is essential for a team that needs to adapt to unforeseen changes in data and requirements. Version control for Data Science has some differences to version control for software development [2].
  5. Documentation and workflow tracking. The previous steps go a long way towards enabling a team to produce defensive Data Science. While the overhead of documentation must be kept to a minimum in  a very dynamic environment, there are some basic activities that benefit from documentation. Having a wiki makes it easy to version control changes to evolving team knowledge. Typical uses of a wiki for defensive Data Science include documenting data dictionaries and business rules. Furthermore, a simple workflow tracker for keeping tabs on what the team is doing and logging received data will make it much easier to maintain data provenance across the team’s activities.

Defensive Data Science need not be an administrative burden on a team. There are many benefits and a few simple principles go a long way to making a team easier to manage and more adaptable to dynamic project environments.

Do you have anything you wish to add? Please get in touch or have a look at Guerrilla Analytics. I’d be happy to discuss.

Further Reading

[1] Defensive Data Science: What we can learn from software engineers, http://www.predictiveanalyticsworld.com/patimes/defensive-data-science-what-we-can-learn-from-software-engineers0811153/

[2] Guerrilla Analytics: A Practical Approach to Working with Data, Enda Ridge 2014

Best Practices When Starting And Working On A Data Science Project

Several topical questions were recently asked on Data Science Central. This post addresses the question “What best practices do you recommend, when starting and working on enterprise analytics projects?” I have worked as a Data Scientist for 8 years now. This was after completing a PhD on “Design of Experiments for Tuning Optimisation Algorithms”. So I have a formal background in rigorous experiment design for Data Science and have also managed some pretty complex and fast paced projects in sectors including Financial Services, IT, Insurance, Government and Audit.

Best practice guidelines

Several interesting questions were asked recently on a Data Science social network. This post addresses the question

“What best practices do you recommend, when starting and working on enterprise analytics projects?”

I have worked as a Data Scientist for 8 years now. This was after completing a PhD on “Design of Experiments for Tuning Optimisation Algorithms”. So I have a formal background in rigorous experiment design for Data Science and have also managed some pretty complex and fast paced projects in sectors including Financial Services, IT, Insurance, Government and Audit.

This post summarises my thoughts on best practice that are heavily based on practical experience as described in my book “Guerrilla Analytics: A Practical Approach to Working with Data”. The book contains almost 100 best practice tips for doing Data Science in dynamic projects where reproducibillty, explainability and team efficiency are critical.

Here is a summary of the best practices for working on enterprise analytics projects.

  • Soft Skills Best Practice
    • Consult
      • Understand the business problem
      • Understand the stakeholders
      • Understand your STARS situation
    • Communicate
      • Explain what you are doing and why
      • Explain the caveats in interpreting what you are doing
      • Always focus on the business problem
      • Continuously validate the above
    • Budget and Plan
      • Clearly set out your approach, milestones and deliverables
      • Measure progress and adjust when going off track or moving in a new direction
  • Technical Skills Best Practice Using the 7 Guerrilla Analytics Principles
    • Operations
      1. Keep Everything (Principle 1: Space is cheap, confusion is expensive)
      2. Keep It Simple (Principle 2: Prefer simple, visual project structures and conventions)
      3. Automate (Principle 3: Prefer automation with program code )
      4. Maintain Data Provenance (Principle 4 Maintain a link between data on the file system, data in the analytics environment, and data in work products)
      5. Version Control Data and Code (Principle 5: Version control changes to data and analytics code)
      6. Consolidate (Principle 6: Consolidate team knowledge in version-controlled builds)
      7. Think like a developer (Principle 7: Prefer analytics code that runs from start to finish)
    • Testing
      • Test data with the 5 Cs of Guerrilla Analytics Data Quality
      • Test code. Take a risk based approach to small, medium and large tests to improve confidence in the correctness of data manipulations, data cleaning and the application of business rules.
      • Test models. Always reviews the standard tests that accompany a model or algorithm. Run models against new data to make sure they perform.

The rest of this post will look at these best practice guidelines in more detail.

Soft Skills Best Practice

Consult

startup-593341_1920

Whether your job title is ‘consultant’ or not, the fact is you are probably acting as as consultant to some degree. See Peter Block’s excellent book “Flawless Consulting: A Guide to Getting Your Expertise Used” on the Data Science Reading List for more information. Recognise that you are in a position where you need to influence stakeholders to use your insights and take action. That means you need to:

  • Understand the business problem. Ask questions, take notes, play back your understanding until you have the best understanding possible of what the real problem is, its drivers, its blockers and what a successful outcome looks like for all stakeholders. Too much data science fails because it produces the right answer to the wrong problem.
  • Understand the stakeholders. Who is asking you to solve this problem? Why? Who is sponsoring the project and what are their concerns, drivers, targets?
  • Understand your STARS situation. Are you in a situation that is a Start-up, Turnaround, Accelerated growth, Realignment, Sustaining success? Each of these requires a different approach from fast-action heroics at one end of the spectrum to carefully planned maintenance and improvement at the other. You can read more about these ideas in the excellent book “The First 90 Days: Critical Success Strategies for New Leaders at All Levels”.

Communicate

cooperate-437511_1920

For your work to be successful, you need to be able to communicate what it is you have done and why your audience should care. This applies regardless of the level you are operating at.

This may be disappointing for a budding Data Scientist but your most sophisticated and clever work will only be appreciated in the context of your ability to consult as above. A ‘decrease in Type 1 error’, a ‘better Gini’, a responsive beautiful visualization are only of value when cast in terms that address the business problem in ways your stakeholders care about and can understand. Your manager who trusts your abilities may not need to know the minutiae of your workings but rather that you have taken a sensible approach and that you are clear on its limitations and any inherent risks/caveats. When communicating,

  • be able to summarise your key insights on a page
  • keep the technical details to an appendix. Your objective is rarely to impress with technicalities. Instead it is to deliver insight that leads to action.
  • be able to visualize your insights with a story that engages your audience

Budget and Plan

notes-514998_1920

Data Science projects are notoriously difficult to budget and plan because they are typically of an exploratory nature. Many can run indefinitely when poorly managed teams exhaustively mine data sets. There are almost unlimited ways to cut data and present it. This does not let you off the hook however. Your stakeholders and time with business SMEs is limited (they have day jobs). Your own time is limited. When running your project,

  • set goals with timelines
  • measure and track progress and adjust when necessary
  • avoid the temptation to do something because it’s cool or fun before validating it with your stakeholders

Win-Vector gives a really excellent post on how to set expectations in Data Science projects.

Technical Best Practice with Guerrilla Analytics

code-820275_1920

Of course, Data Science is a technical discipline. There are 7 best practice guidelines called the Guerrilla Analytics Principles that will help keep everything running smoothly despite the very dynamic situations you are faced with. These principles apply across the entire analytics lifecycle from data extraction through to reporting.

Operations

Keep Everything (Principle 1: Space is cheap, confusion is expensive)

apples-346772_1920

Having a record of everything makes it easier when Data Science work needs to explain what was done, how understanding evolved, and why there may be errors or caveats around interpretation of results. Best practice tips include:

  • Keep all the data your receive, even older broken versions.
  • Keep all modifications to the data.
  • Keep all communications about the data (meetings, notes, dictionaries).
  • Keep all work products you create, even when they are superseded or replaced because they were wrong. This avoids confusion and conflicting results in what is a highly iterative environment.

Keep It Simple (Principle 2: Prefer simple, visual project structures and conventions)

Simple project and team conventions are easier to remember and therefore easier to follow. The more your team can look at a project structure and understand the purpose of the structure then the less time they waste and the lower the risk of inconsistencies. Some examples of best practice include:

  • Have one place for all data the team receives.
  • Take a consistent approach to loading data into the analytics environment.
  • Have one place for all work products the team produces.
  • Keep supporting materials (emails, documentation etc) near the data they support.

Automate (Principle 3: Prefer automation with program code)

workflow

It’s inevitable that your project will involve multiple code files, perhaps from multiple tools, that must be repeatedly run in the correct order by multiple team members. Some of my projects have had several hundred SQL files for preparing data. When you develop a machine learning or statistical model, it is typically an iterative process that involves deriving new variables, re-running algorithms and creating outputs and visualizations to test the model performance.

  • Write code that is modular so parts of an analysis can be re-run and tested
  • Write code that can be run from the command line to facilitate simple automation.
  • Use a script for automating your code or use something a little more sophisticated like a build tool.

Maintain Data Provenance (Principle 4 Maintain a link between data on the file system, data in the analytics environment, and data in work products)

factory-35108

Data Provenance simply means knowing where your data came from, how it was changed, who changed it, what was delivered and to whom it was delivered. If you can maintain this link from team inputs through to team outputs then your data science project will be much easier to manage, risks of inconsistency and incorrectness will be mitigated and team efficiency will be promoted. Some tips that help maintain data provenance include:

  • Give all received data its own unique identifier
  • Give all work products issued by the team their own unique identifier
  • Keep simple logs to record data receipt and work product creation

Version Control Data and Code (Principle 5: Version control changes to data and analytics code)

growing-109229_1920

With so much going on in the project and with the typically iterative nature of Data Science, it becomes imperative to use version control. You will need to go back to previous versions of code, branch a copy of current work to test a theory and undo mistakes. At a minimum you should:

  • version control data received
  • version control program code used to produce analytics
  • version control work products that are issued by the team

With this in place, any work product can be identified as say ‘version 3 of the work product, using version 3 of my code which draws on version 2 of the data received’.

Consolidate (Principle 6: Consolidate team knowledge in version-controlled builds)

It is inevitable that data cleaning rules, business rules and lessons learned will emerge over the course of a project. If each team member is applying these individually then the team is not performing efficiently and there is a risk of inconsistency. Take even a simple ranking operation. Do we mean a dense rank (1223) or something else (1224, 1334 etc)?

  • Identify the latest true version of data received and publish that centrally to the team
  • Identify common data manipulations, data cleaning and business rules. Implement them centrally and publish them to the team

Think like a developer (Principle 7: Prefer analytics code that runs from start to finish)

Data Science is not software development although both use program code. Data Science typically involves a lot of profiling of data to understand its properties. Models and rules are trialled to see if they perform well on the data. Cleaning rules are continuously discovered as understanding of the data grows.

This inevitably leads to many ‘code snippets’ that are necessary to developing an understanding of the data but not required for the final work product. These code snippets usually break over the course of the project, clog up work product code and confuse team members and reviewers when it comes to reproducing a result.

Writing analytics code that executes end-to-end eliminates these types of bugs and ‘code noise’. Team efficiency is improved, reproducibility of results is guaranteed and risk of data loss is eliminated since deleted data sets can easily be restored with a quick code execution.

Testing

software-762486_1920

So much of testing is overlooked in Data Science. Its importance lead me to devote 4 chapters to Data Science Testing in “Guerrilla Analytics: A Practical Approach to Working with Data”.

At a minimum, high performing teams are doing some amount of testing. The testing falls into three areas. But before discussing those, pay attention to the overarching testing best practices.

  • Take a risk-based approach. There is not time to test everything exhaustively. Make sure that the most critical aspects of the data science work are being tested first.
  • Test early and often. Do not be tempted to put off testing until later in your project. Many of the models you build and data transformations you code may have to change if flaws are discovered in the data or in code.
  • Automate testing. To facilitate testing often, it helps to have some form of automation of test scripts so that tests can be easily repeated and you immediately know when something has gone wrong.

Test Data with the 5 Cs of Data Quality

test-214185_1280

The excellent Bad Data Handbook lists 4 data test categories which I extend to 5 for “Guerrilla Analytics: A Practical Approach to Working with Data”. The 5 Cs of Guerrilla Analytics data quality are:

  1. Completeness: Do you actually have all the data you expect to have?
  2. Correctness: Does the data actually reflect the business rules and domain knowledge you expect it to reflect?
  3. Consistency: Are refreshes of the data consistent and is the data consistent when it is viewed over some time period?
  4. Coherence: Does the data “fit together” in terms of its expected relationships?
  5. aCcountability: Can you trace the data to tell where the data came from, who delivered it, where it is stored in the DME, and other information useful for its traceability?

Test Code

Testing of analytics code typically covers the data preparation and manipulation code that leads up to a visualization or application of an algorithm. Testing is a well established field in software engineering but not so in Data Science. “How Google Tests Software” is a great introduction and identifies 3 types of tests. In the context of data science, these are

  • Small: tests that wrap around small units of code, usually units that contain some particularly complex data manipulation or cleaning rules. Some example include application of regular expressions, calculations of running totals, identification of duplicates.
  • Medium: tests that wrap around a ‘component’ of multiple units. This might include the joining up and end-to-end cleaning of a particular data segment such as ‘customer’ or ‘products’.
  • Large: tests that wrap around the entire end-to-end project. For example, does the output of the machine learning model still contain the expected number of customers or where customers accidentally dropped somewhere between raw data and algorithm output?

In addition, some best practice tips for facilitating testing include:

  • Write modular code, typically having one code file per data step. This makes it easier to perform small and medium tests and makes it easier for several team members to work simultaneously in the code base without blocking one another.
  • Use a common structure for test code. All test code should have a setup, test and tear-down structure. This makes it easier to automate and debug tests.

Test Models

distribution-159626

When testing statistical models and machine learning outputs, things understandably a little more domain specific. At a minimum, best practice involves:

  • Running model tests. For example, tests of regression include heteroscedasticity, normality, correlation, and leverage.
  • Using cross-validation. Build and test models on a variety of partitions of the data to help minimise overfitting
  • Testing model predictions. See how the model performs on data it has not encountered before.

Summary

There is a lot to be mindful of in terms of best practices when doing Data Science. The key insight from “Guerrilla Analytics: A Practical Approach to Working with Data” is that 7 Guerrilla Analytics Principles can mitigate the operational risks inherent in Data Science projects. You can read more about these risks, the Guerrilla Analytics Principles and see almost 100 examples of their practical application in the book “Guerrilla Analytics: A Practical Approach to Working with Data”.

Data Scientists Need a Better Operating Model

In many cases, where Data Scientists struggle on projects has nothing to do with the technical complexity of problems or any lack of Data Science skills – they have all of that from their study and training and are quite motivated people who are passionate about their field. In fact, what makes Data Science difficult for many is the complexity of operating in a Data Science project environment.

A Data Science Operating Model

Motivation

In my job, I interview many potential Data Scientists and Data Analysts. I have also managed people with a wide range of experience from interns to seasoned PhDs with degrees in fields including Computer Science, Chemistry, Physics, Mathematics, Engineering and the Humanities. Just last week I had several conversations with prospective Data Scientists who are early in their careers and wondering what projects they should try to get on, what technologies they should learn and what additional courses they should study.

In many cases, where Data Scientists struggle on projects has nothing to do with the technical complexity of problems or any lack of Data Science skills – they have all of that from their study and training and are quite motivated people who are passionate about their field.

In fact, what makes Data Science difficult for many is the complexity of operating in a Data Science project environment. Specifically, a Data Scientist has to operate in an environment that looks like the following.

  • Dynamics of data: data will change over the course of most projects. It will be refreshed, added to, replaced and repaired. Manual data sources are a common way of interfacing with other team members outside the Data Science team. Since much Data Science involves bringing together disparate data sources in novel ways, it is rare for all of this data to arrive at the same time and to schedule. So Data Scientists have to cope with trying to design and implement their work on top of a base of data that is always in flux.
  • Dynamics of requirements: Data Science is exploratory. You really don’t know what’s in the data until you have worked with it. Typically several algorithms and analyses have to be tried out. The insights from these activities often lead to the project taking a new direction and new analyses being framed for these new requirements.
  • Dynamics of people: it is rare to work in isolation. A Data Scientist will typically interact with IT, warehousing, developers, business SMEs, third party data providers and, of course, their team mates and their customer. This means that other people are providing inputs to their data, other people are writing code and creating data sets they depend on and other people are presenting results they may have contributed to. When other team members leave or take vacation, they may be expected to take over work.
  • Constraints on time and resources: despite the dynamics above, the Data Scientist will be expected to add value and deliver successfully in limited time and with limited resources. You don’t always get the ideal technology stack or one that you are familiar with. You don’t always get all the skill sets you need on a project. And you don’t always get all the data for a perfect analysis.

If a Data Scientist does not have methods for coping with these dynamics and constraints then they will struggle to perform. Ultimately, they will rarely see the more advanced analytics where they can really add value.

  • They become mired in forensics of their own work and their team’s work
  • Time is wasted investigation and explaining inconsistencies
  • Deliverables must be rewritten because the original cannot be reproduced or cannot be explained
  • A team descends into reactively producing analyses rather than leading the project from their data and their deliverables
  • Results are plain wrong because of the chaos that arises from project dynamics and constraints

A Data Science Operating Model with Guerrilla Analytics

Guerrilla Analytics and its 7 Principles provide a tried and tested operating model for Data Scientists. It has been used in many high pressure, dynamic and constrained project environments to deliver analyses that are reproducible, auditable and explainable.

This Guerrilla Analytics operating model breaks Data Science activities into the following components, highlighting the challenges faced in each component and offering guidelines on how to overcome these challenges.

  • Data Extraction: how data is extracted and transported by a team in a traceable manner
  • Data Receipt: how data should be received and logged by a team
  • Data Load: how to load multiple versions of data into an analytics environment without breaking data provenance
  • Coding: how data should be manipulated in ways that promote flexibility, testability, audit and agility. How to structure code and how to mix multiple tools and programming languages without being overwhelmed.
  • Work products and Reports: how to produce multiple versions of agile work products and project milestone reports so they can be tracked easily with a customer or fellow team members
  • Building consolidated analytics: how to identify and control consolidated understanding, business rules and data sets that emerge over the course of a project to promote efficiency and consistency and to avoid re-inventing the wheel
  • Testing: how to test analytics code and data sets in a fast paced environment
  • Workflows: simple workflows for peer review and quality control

Operating models may not produce beautiful visualizations or involve high end statistics and machine learning. However they do allow Data Scientists to hit the ground running. They provide Data Scientists with the tools they need to survive real world project environments. This is turn improves the Data Scientist’s coordination with team members, their efficiency, their credibility, and ultimately increases the opportunities to add value.

We expect methodology from traditional laboratory scientists. Let’s expect the same from Data Scientists.