Building a Data Science Capability: Inspirational Keynote at Data Leaders Summit Europe

I’ve just delivered the inspirational keynote at Data Leaders Summit Europe, 2018. Lots of great engagement and feedback. In particular, it seems people liked a clear definition of what data science actually is and the practical steps (miss-steps) I took in building a capability at Sainsbury’s.

You can find all the slides here on SlideshareAs always, if you have questions or want to discuss then please get in touch!

Distinguishing Data Analytics from Data Science. 5 implications for your organisation

People often struggle distinguishing Data Analytics from Data Science. These are two related but completely distinct disciplines which are both important to a business. This post distinguishes Data Analytics from Data Science and lists the implications of that distinction for your organisation.

Forget about data for a minute

Think about a traditional scientist and what they do. You wouldn’t define them by their use of a petri dish, or a microscope, or any other tool. They are defined by following the scientific method. They aim to understand the world by producing mathematical models of the world from observed data and collecting further data to validate those models. Think about a nuclear physicist. They use mathematics and computer simulation to model the behaviour of subatomic particles. If these models are good, they predict the behaviour of those subatomic particles well in all general cases. For a model to be good, scientists must be able to reproduce it and it must allow us to reason about the real world. Amazing science has been done for centuries with pen and paper and simple apparatus but always by following the scientific method.

Distinguishing Data Analytics from Data Science

Data Science is a science like any other. It is irrelevant what apparatus is used so stop defining Data Science by Machine Learning, Venn Diagrams or programming languages. A Data Scientist models a business (customers, products, processes, web sites, stores, machinery) by gathering suitable data and evaluating models. If the Data Science is successful, the models will generalise well and allow a business to make predictions and optimisations about their customers, products, processes etc. Data Scientists are effectively creating data generating processes (experiments and models) to test hypotheses.

Data Analysts look at existing data to report patterns, summaries, populations, trends etc. Yes, Data Scientists look at existing data before creating an experiment. Yes, they should do Analytics on the outputs of their experiments to understand what is going on. But fundamentally, Analytics is tactical Business Intelligence. Data Science is, well, science!

Implications for your organisation

Implication 1: Analytics needs complete data, Science does not

I often get some strange looks when data engineers and architects promise Data Science ‘all the data’ and I say they don’t need it. Data Scientists need samples of data. Yes, those samples need to be of sufficient size to build good generalisable models. Yes, the data should not be biased or biases should be clear. But using ‘all the data’ is probably a bad thing when building models. Analytics, being a type of tactical reporting, needs complete data because it is generating business KPIs. A profit KPI plus or minus 10% isn’t very helpful. A statistical definition of a customer won’t help you scale your website – you need to know hits and logins.

Implication 2: Mature Analytics becomes reporting, mature Data Science becomes algorithms

If Analytics starts to identify common requests, new KPIs of interest, common sub-populations of interest to the business etc then these should be productionised in Reporting. There is no point having a team of Analytics repeatedly writing the same queries. Put them in a dashboard so the business can self-serve.

Data Science, by contrast, is creating those data consuming and data generating models. If models need to be available for decision making then those models should be productionised in algorithms.

Implication 3: Data Science results can be used by Analytics but Analytics results should rarely feed into Data Science

A business will want to report on the decisions made by the models productionised as algorithms. It therefore makes sense that tactical queries from Analytics could be run against algorithm output data.

Analytics, however, does not typically produce results that feed into Data Science work. Analytics can help inform the Data Scientist about the domain. After all, Analytics know the typical tactical reports and the typical KPIs the business uses. However, when it comes to model variables, the Data Scientist needs to figure that out for themselves and evaluate those variables as part of their experimental process.

Implication 4: Analytics has fewer dependencies than Data Science

Analytics can be run tactically off a data warehouse with few other dependencies beyond the right tools. Sure, without a reporting function to productionise Analytics, an organisation will never consolidate its tactical queries and will be forever in a state of panicked queries. But the Analytics will still get done and the business will get their information.

If Data Science models are not turned into engineered algorithms then an organisation is simply wasting its time and money on curiosities. Data Science benefits from the same warehouse of data as Analytics. But it also needs a mechanism to bring in other new data sources. It needs an engineering team to turn its models into algorithms. An engineering team needs a testing and support team to make sure things keep working. An any change in automation and decision making needs change management to make everybody comfortable with the work the algorithm will do.

Implication 5: Data Analytics to Data Science is a big career leap

The hype around Data Science has unhelpfully led to analytics professionals rebranding themselves. Awesome Analytics professionals are customer facing, know an organisation’s data inside out, can wrangle and manipulate data quickly and produce relevant and accurate business KPIs with little business steer. They tell a compelling story with visualizations. Their code will never be productionised. None of this helps them be awesome Data Scientists.

A significant part of a scientist’s work involves distilling a business objective into hypotheses. Experiments need to be designed to choose the right model and evaluate the robustness of that model. Experiments need to be assessed for significance, bias, confounding, blocking, correlations etc. And when a good model is found, a knowledge of software engineering for productionisation is required. What is the computational complexity? What data should be logged for that scientific reproducibility? What data needs to be filtered out to avoid model degradation and biased results? These are questions an Analytics professional shouldn’t need to ask.

Taking heart

Distinguishing Data Analytics from Data Science is not a competition. Both functions are clearly important for an organisation. You need the capability to mine your data for patterns and summaries that are not yet available in reporting. You need the capability to rigourously create models that can help automate decision making. Data Science may currently be more hyped than Analytics but perhaps a reckoning is coming. Models and even the model tuning process are becoming increasingly commoditised. Organisations will hopefully see sense and stop rewarding Data Scientists simply for knowing the APIs of a concoction of evolving programming libraries and will instead focus on the production of models that are understood and that they can have confidence in. That will only come from those Data Scientists who understand the scientific method and how to apply it.

13 Steps to Better Data Science: A Joel Test of Data Science Maturity

Data Science teams have different levels of maturity in terms of their ways of working. In the worst case, every team member works as an individual. Results are poorly explained and impossible to reproduce. In the best case, teams reach full scientific reproducibility with simple conventions and little overhead. This leads to efficiency and confidence in results and minimal friction in productionising models. It is important to be able to measure a team’s maturity so that you can improve your ways of working and so you can attract and retain great talent. This series of questions is a Joel Test of Data Science Maturity. As with Joel’s original test for software development, all questions are a simple Yes/No and a score below 10 is cause for concern. Depressingly, many teams seem to struggle around a 3.

A Joel Test of Data Science Maturity

  1. Are results reproducible?
  2. Do you use source control?
  3. Do you create a data pipeline that you can rebuild with one command?
  4. Do you manage delivery to a schedule?
  5. Do you capture your objectives (scientific hypotheses)?
  6. Do you rebuild pipelines frequently?
  7. Do you track bugs in your models and your pipeline code?
  8. Do you analyse the robustness of your models?
  9. Do you translate model performance to commercial KPIs?
  10. Do new candidates write code at interview?
  11. Do you have access to scalable compute and storage?
  12. Can Data Scientists install libraries and packages without intervention by IT?
  13. Can Data Scientists deploy their models with minimal dependencies on engineering and infrastructure?

 

1. Are results reproducible?

A core aspect of traditional science is that results be reproducible. This is essential when building models of the world that aim to improve our understanding of the world. It is no different for Data Science. And it turns out the reproducibility promotes efficiency. Teams no longer waste time wondering which data led to a particular result, which code led to a particular result and why results might have changed as understanding of the problem improved.

2. Do you use source control?

Building algorithms and data pipelines is complex. Source control lets you track changes to your code, roll back poor changes and try out new ideas without breaking working code.

3. Do you create a data pipeline that you can rebuild with one command?

A version controlled data pipeline allows you to centralise and consolidate your understanding of the data (business and cleaning rules) and your definition of features that feed into an algortihm. If you can rebuild this pipeline with one command then you can quickly iterate as your understanding of the problem evolves and as you inevitably discover issues with the data.

4. Do you manage delivery to a schedule?

Data science needs a schedule to keep it focused. As projects are often open ended and exploratory, you need to have clear checkpoints where you can make a call that perhaps ‘this data is not fit for purpose’ or ‘there is no value in further iterations of model refinement’. Teams that do not deliver to any schedule tend to drift into perfection being the enemy of done.

5. Do you capture your objectives?

Every data science problem is really an optimisation problem and you cannot optimse without an objective. Although it can sometimes feel painful or appear ‘picky’, it is essential that the objective of a project and a model are clearly defined. Increate profit? Increase volume? Increase both with some balance? Get clear and agree with your customer.

6. Do you rebuild pipelines often?

Like traditional software, rebuilding often can highlight integration bugs. In the context of data science integration bugs are effectively data flows through a pipeline. If you do not rebuild often it is possible to introduce cyclic references into your data preparation, lose the logic for creation of a feature and other nasty bugs that cause you to lose that essential reproducibility.

7. Do you track bugs in your model and in your pipeline code?

Data science model development is complex. It has many dependencies. Customer feedback and domain knowledge are incredibly valuable. Make sure you are tracking feedback so mistakes are not repeated and so your models are always improving.

8. Do you analyse the robustness of your models?

No model will work in all scenarios and poor performing models are dangerous. It is important to analyse and understand the conditions under which your model will work and under which it will degrade. This is robustness analysis. Are model outputs biased? Does a model require 6months of training data or 2 weeks? Does a model only perform once it has seen 5 customer journeys? A mature data science team has confidence pushing its models into production because this type of testing has been done in advance.

9. Do you translate model performance to commercial KPIs?

Technical performance metrics are important for you as a technical data scientist. However, to get business buy-in and adoption of your models you need to be able to make your models commercially relevant. That means turning predictions into revenue or cost savings or time savings or whatever the business cares about and whatever will justify further funding of your work.

10. Do new candidates write code at interview?

Data science is full of hype, bluffers and analytics rebranding itself. You want to filter down to the great candidates who understand the scientific method and can apply it to select and tune models. A technical test that involves using data and writing code is the most effective way to do this.

11. Do you have access to scalable compute and storage?

The complex combination of technologies needed for Data SCience often means that organisations struggle to enable their teams with the best technology to do their jobs. If your team does not have access to scalable compute and storage then their success will always be limited. Lack of a central place to store data and workings is a warning sign that Data Science is not taken seriously in an organisation.

12. Can data scientists install libraries and packages without intervention by IT?

If there is one word that summarises the requirements of Data Science it is ‘flexibility’. The nature of the work involves selecting models and tuning them against data. This means being able to quickly install and evaluate lots of model libraries. If a Data Science team needs approval for every library installation and upgrade then its speed of turnaround is going to slow from days to weeks and months.

13. Can Data Scientists deploy their models with minimal dependencies on engineering and infrastructure?

If models cannot be put into use they are of little value beyond curiousities. But deploying a model involves training on reproducible data, monitoring of decisions and performance and A/B testing of new releases. Delays in deployment mean models go out of date or competitive advantage is lost. The best organisations have platforms that allow model deployment to happen quickly, driven by Data Scientists.

So how do you score a 13/13?

How would your team score on a Joel Test of Data Science Maturity? This is where Guerrilla Analytics can help. Guerrilla Analytics provides guiding principles and conventions for promoting data provenance and reproducibiltiy in Data Science and Analytics work. There are guidelines on how to structure projects at every stage of the life cycle and how to consolidate knowledge in flexible data pipelines. You will also learn how to leverage techniques and tools from software engineering such as testing and source control.

Data Science – A Definition And How To Get Started

Confusion, hype, failure to start. Data Science has huge potential to change an organisation. It has real potential to improve the top and bottom lines. But many organisations become mired in the associated cultural, technological and people change. Data Science is delivered as an interesting report rather than a driver of change. Data Science identifies algorithms that are run in the safety of a lab but never make it into production.

Why does this happen and what can be done to avoid it?

I think we all agree that our organisations would be more effective if they made decisions based on data. Our organisations would be more efficient if they embraced a scientific approach to understanding their customers and their products. Our organisations would be vastly improved if they could put their learnings from data science into effect.

This is my keynote talk from the Polish Data Science with Business Conference. It covers:
– a practical definition of Data Science in retail and how retail can benefit
– operational and organisational challenges and conflicts
– Get started! 5 tips to proving your worth to leadership starting now

Have you any thoughts or experiences on embedding Data Science in enterprises? Please get in touch!

10 Data Science Capabilities Your Team Needs (and the Tools to Support Them)

I’ve recently had several people ask me about tools for Data Science both online and at conference talks. This post lists some of the tools I use and the capabilities they provide.

When writing Guerrilla Analytics: A Practical Approach to Working with Data I deliberately avoided mention of tools. People can be dogmatic about tools and I thought this would be a distraction from the book’s core message around principles for doing effective Data Science in dynamic real-world projects.

That said, people do want some guidance in what can be a very overwhelming and fast moving field. Managers want to know what to buy and where to invest training. Junior data scientists, students, dev ops engineers and system administrators want to know what to learn. I will focus on the important capabilities for a Data Science team and the tools I have found useful for enabling those capabilities.

1. Version control with Git and git-flow

Capability: Typically you will go through many iterations of your code and the work products your code produces. It quickly becomes impossible to track changes and reproduce earlier work without some code version control tool. This is only exacerbated when your team size is >1.

Tool: Git is a great version control system and the effort to learn its command line interface is a very worthwhile investment.

Git is incredibly flexible. However this can lead to confusion and inconsistency in how it is applied. Git-flow is a set of scripts that automate much of what you will need to do in Git subject to a particular convention that happens to be very helpful for Data Science.

2. Wrangling and persisting data with PostgreSQL

Capability:  Even if your data is small enough to fit in memory, reproducing work will involve running all those scripts into memory before you can pick up where you left off. Other team members have to do the same. This is painful and inefficient. You therefore need to persist your work (raw data, intermediate datasets and work products).

Tool: A database gives you a way to persist your workings and intermediate datasets as well as share with team members. Pick a database the is performant and flexible. I use PostgreSQL. It has an amazing set of features and this flexibility is what you want when doing Data Science.

3. Wrangling and visualizing data with PandasMatplotlib and Seaborn

Capability: Getting your head around your data and preparing it for a variety of algorithms is probably the most time-consuming and important part of the Data Science life cycle. Some preparations are easier done outside of many databases e.g. some natural language processing. Visualizing the data is really important here too.

Tool: Pick a programming language that has great data reshaping and visualization capabilities. If you work in Python, Pandas is a powerful set of data structures and algorithms for wrangling. Seaborn and Matplotlib are good places to start for visualization. And don’t waste time trying to get all these things to work together. Just use Continuum’s excellent distribution Anaconda.

Read: Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython

4. Documentation with Markdown

Capability: Data Science is useless without communication (to your customer and within your team). You could just write a report as a Word document. There’s nothing wrong with that and it’s a format your business customers will expect. However, it would be great to have a documentation that is easy to version control and can be kept close to your project code.

Tool: Markdown is a nice platform-neutral way to document your project. Because it’s plain text it’s easy to version control (see above). And if your report isn’t too complicated you can convert it to Word from Markdown. Win.

5. Fast file manipulation, cleaning, and summarising at the command line

Capability: You get hundreds of data files. You get huge files in strange formats with broken delimiters. You want to chop these up, patch them together, change their encodings, unravel XML etc etc. No, trying to open the file in a text editor or spreadsheet is not the answer.

Tool: This is best done at a powerful command line. Linux is worth learning.

Read: Data Science at the Command Line: Facing the Future with Time-Tested Tools

6. Story telling with Jupyter Notebooks

Capability: Data Science is difficult to communicate. It’s often a slightly meandering journey with dead ends, back-tracking, unexpected insights leading to new research avenues etc. When updating your customer, you need to walk them through some of this journey using narratives interleaved with graphics and tabular data. Code files won’t do. Duplicating into Powerpoint is a lot of extra work for a quick interim update.

Tool: Jupyter allows all of the above in presentation quality. The close interleaving of analysis and documentation helps other team members join a project. And it reduces duplication when you decide it’s time to stop coding and start updating your customer.

7. Build automation with Luigi

Capability: eventually, your understanding and your code start to consolidate. There are some core datasets. They go through some agreed preparatory steps. There are some reports and algorithm datasets that you want to lock down and reproduce several times during the project. Manually running all those code files is a pain.

Tool: build automation tools allow you to automate tasks such as executing code files, creating documentation, importing and exporting data etc etc. I’ve used command line scripts (see above) and software build tools like Ant for this automation. More sophisticated tools like Luigi are now reaching a level of maturity where you could consider them for your team too.

8. Workflow tracking with JIRA

Capability: what the hell is everybody doing? Where did that data come from? Where is the conversation with the system SME that led to that business rule? Where is the deliverable from 2 weeks ago and who sent it to which customer?

Tool: workflow tracking tools like JIRA help answer all the above questions. Look for a tool that is customizable as Data Science doesn’t need all the detail of a large scale software development project. Do make sure you track where your data is coming from and what deliverables are going out the door (see Guerrilla Analytics).

9. Packaging it all up with Vagrant

Capability: the diverse nature of Data Science activities leads to a correspondingly diverse set of tools as you’ve seen above. When you get things working, you would rather not break them and you would rather not force every team member to go through the same painful installations and configurations and risk inconsistency.

Tool: Vagrant and other ‘dev ops’ tools allow you to define your tech setups and their configuration in program code. What does that mean? It means that you can build your entire technology stack and configure it by running some code. It also means that the installation of all your tools and their configuration can be version controlled. As your technology stack evolves, update your code and issue a new release to your team. If you trash your technology or need to move to other servers, everything you need to reproduce your environment has been captured and you should be back up and running in minutes.

Read: Vagrant: Up and Running

10. Putting it all together – Operations with Guerrilla Analytics

I’ve covered a lot here. How do you put this all together without choking a team in conventions, rules, tools etc? How do you reduce Data Science chaos and continue to deliver iteratively and at pace? That’s where Guerrilla Analytics can help.

Have a read around this blog, check out the book and get in touch with any questions!

10 Data Science Capabilities (and supporting tools)

I’ve recently had several people ask me about Data Science capabilities both online and at conference talks. This post lists some of the tools I use and the capabilities they provide.

When writing Guerrilla Analytics: A Practical Approach to Working with Data I deliberately avoided mention of tools. People can be dogmatic about tools and I thought this would be a distraction from the book’s core message around principles for doing effective Data Science in dynamic real-world projects.

That said, people do want some guidance in what can be a very overwhelming and fast moving field. Managers want to know what to buy and where to invest training. Junior data scientists, students, dev ops engineers and system administrators want to know what to learn. I will focus on the important capabilities for a Data Science team and the tools I have found useful for enabling those capabilities.

1. Version control with Git and git-flow

Capability: Typically you will go through many iterations of your code and the work products your code produces. It quickly becomes impossible to track changes and reproduce earlier work without some code version control tool. This is only exacerbated when your team size is >1.

Tool: Git is a great version control system and the effort to learn its command line interface is a very worthwhile investment.

Git is incredibly flexible. However this can lead to confusion and inconsistency in how it is applied. Git-flow is a set of scripts that automate much of what you will need to do in Git subject to a particular convention that happens to be very helpful for Data Science.

2. Wrangling and persisting data with PostgreSQL

Capability:  Even if your data is small enough to fit in memory, reproducing work will involve running all those scripts into memory before you can pick up where you left off. Other team members have to do the same. This is painful and inefficient. You therefore need to persist your work (raw data, intermediate datasets and work products).

Tool: A database gives you a way to persist your workings and intermediate datasets as well as share with team members. Pick a database the is performant and flexible. I use PostgreSQL. It has an amazing set of features and this flexibility is what you want when doing Data Science.

3. Wrangling and visualizing data with PandasMatplotlib and Seaborn

Capability: Getting your head around your data and preparing it for a variety of algorithms is probably the most time-consuming and important part of the Data Science life cycle. Some preparations are easier done outside of many databases e.g. some natural language processing. Visualizing the data is really important here too.

Tool: Pick a programming language that has great data reshaping and visualization capabilities. If you work in Python, Pandas is a powerful set of data structures and algorithms for wrangling. Seaborn and Matplotlib are good places to start for visualization. And don’t waste time trying to get all these things to work together. Just use Continuum’s excellent distribution Anaconda.

Read: Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython

4. Documentation with Markdown

Capability: Data Science is useless without communication (to your customer and within your team). You could just write a report as a Word document. There’s nothing wrong with that and it’s a format your business customers will expect. However, it would be great to have a documentation that is easy to version control and can be kept close to your project code.

Tool: Markdown is a nice platform-neutral way to document your project. Because it’s plain text it’s easy to version control (see above). And if your report isn’t too complicated you can convert it to Word from Markdown. Win.

5. Fast file manipulation, cleaning, and summarising at the command line

Capability: You get hundreds of data files. You get huge files in strange formats with broken delimiters. You want to chop these up, patch them together, change their encodings, unravel XML etc etc. No, trying to open the file in a text editor or spreadsheet is not the answer.

Tool: This is best done at a powerful command line. Linux is worth learning.

Read: Data Science at the Command Line: Facing the Future with Time-Tested Tools

6. Story telling with Jupyter Notebooks

Capability: Data Science is difficult to communicate. It’s often a slightly meandering journey with dead ends, back-tracking, unexpected insights leading to new research avenues etc. When updating your customer, you need to walk them through some of this journey using narratives interleaved with graphics and tabular data. Code files won’t do. Duplicating into Powerpoint is a lot of extra work for a quick interim update.

Tool: Jupyter allows all of the above in presentation quality. The close interleaving of analysis and documentation helps other team members join a project. And it reduces duplication when you decide it’s time to stop coding and start updating your customer.

7. Build automation with Luigi

Capability: eventually, your understanding and your code start to consolidate. There are some core datasets. They go through some agreed preparatory steps. There are some reports and algorithm datasets that you want to lock down and reproduce several times during the project. Manually running all those code files is a pain.

Tool: build automation tools allow you to automate tasks such as executing code files, creating documentation, importing and exporting data etc etc. I’ve used command line scripts (see above) and software build tools like Ant for this automation. More sophisticated tools like Luigi are now reaching a level of maturity where you could consider them for your team too.

8. Workflow tracking with JIRA

Capability: what the hell is everybody doing? Where did that data come from? Where is the conversation with the system SME that led to that business rule? Where is the deliverable from 2 weeks ago and who sent it to which customer?

Tool: workflow tracking tools like JIRA help answer all the above questions. Look for a tool that is customizable as Data Science doesn’t need all the detail of a large scale software development project. Do make sure you track where your data is coming from and what deliverables are going out the door (see Guerrilla Analytics).

9. Packaging it all up with Vagrant

Capability: the diverse nature of Data Science activities leads to a correspondingly diverse set of tools as you’ve seen above. When you get things working, you would rather not break them and you would rather not force every team member to go through the same painful installations and configurations and risk inconsistency.

Tool: Vagrant and other ‘dev ops’ tools allow you to define your tech setups and their configuration in program code. What does that mean? It means that you can build your entire technology stack and configure it by running some code. It also means that the installation of all your tools and their configuration can be version controlled. As your technology stack evolves, update your code and issue a new release to your team. If you trash your technology or need to move to other servers, everything you need to reproduce your environment has been captured and you should be back up and running in minutes.

Read: Vagrant: Up and Running

10. Putting your Data Science capabilities together – Operations with Guerrilla Analytics

I’ve covered a lot here. How do you put this all together without choking a team in conventions, rules, tools etc? How do you reduce Data Science chaos and continue to deliver iteratively and at pace? That’s where Guerrilla Analytics can help.

Have a read around this blog, check out the book and start building your Data Science Capabilities!

Irish Language Data Science lecture at Engineers Ireland

I may be the first person to coin Data Science in Gaeilge!

I gave the following lecture to Engineers Ireland which is the Irish professional body for Engineers. The lecture is about “Data Science and the benefits for engineering” and is entirely in Irish.

It was an interesting exercise to brush up on my Gaelic and also to see the wealth of resources that now exist for using Irish with modern technical vocabulary. If you are curious or are trying to get your Gaelic up to scratch then please get in touch!

In terms of content, it covers what data and data science look like and how traditional engineering problems might benefit from the application of data science.

The full video is linked below.

And here are the slides 2016-04 Engineering Ireland_04_Gaeilge.