Building a Data Science Capability: Inspirational Keynote at Data Leaders Summit Europe

I’ve just delivered the inspirational keynote at Data Leaders Summit Europe, 2018. Lots of great engagement and feedback. In particular, it seems people liked a clear definition of what data science actually is and the practical steps (miss-steps) I took in building a capability at Sainsbury’s.

You can find all the slides here on SlideshareAs always, if you have questions or want to discuss then please get in touch!

Distinguishing Data Analytics from Data Science. 5 implications for your organisation

People often struggle distinguishing Data Analytics from Data Science. These are two related but completely distinct disciplines which are both important to a business. This post distinguishes Data Analytics from Data Science and lists the implications of that distinction for your organisation.

Forget about data for a minute

Think about a traditional scientist and what they do. You wouldn’t define them by their use of a petri dish, or a microscope, or any other tool. They are defined by following the scientific method. They aim to understand the world by producing mathematical models of the world from observed data and collecting further data to validate those models. Think about a nuclear physicist. They use mathematics and computer simulation to model the behaviour of subatomic particles. If these models are good, they predict the behaviour of those subatomic particles well in all general cases. For a model to be good, scientists must be able to reproduce it and it must allow us to reason about the real world. Amazing science has been done for centuries with pen and paper and simple apparatus but always by following the scientific method.

Distinguishing Data Analytics from Data Science

Data Science is a science like any other. It is irrelevant what apparatus is used so stop defining Data Science by Machine Learning, Venn Diagrams or programming languages. A Data Scientist models a business (customers, products, processes, web sites, stores, machinery) by gathering suitable data and evaluating models. If the Data Science is successful, the models will generalise well and allow a business to make predictions and optimisations about their customers, products, processes etc. Data Scientists are effectively creating data generating processes (experiments and models) to test hypotheses.

Data Analysts look at existing data to report patterns, summaries, populations, trends etc. Yes, Data Scientists look at existing data before creating an experiment. Yes, they should do Analytics on the outputs of their experiments to understand what is going on. But fundamentally, Analytics is tactical Business Intelligence. Data Science is, well, science!

Implications for your organisation

Implication 1: Analytics needs complete data, Science does not

I often get some strange looks when data engineers and architects promise Data Science ‘all the data’ and I say they don’t need it. Data Scientists need samples of data. Yes, those samples need to be of sufficient size to build good generalisable models. Yes, the data should not be biased or biases should be clear. But using ‘all the data’ is probably a bad thing when building models. Analytics, being a type of tactical reporting, needs complete data because it is generating business KPIs. A profit KPI plus or minus 10% isn’t very helpful. A statistical definition of a customer won’t help you scale your website – you need to know hits and logins.

Implication 2: Mature Analytics becomes reporting, mature Data Science becomes algorithms

If Analytics starts to identify common requests, new KPIs of interest, common sub-populations of interest to the business etc then these should be productionised in Reporting. There is no point having a team of Analytics repeatedly writing the same queries. Put them in a dashboard so the business can self-serve.

Data Science, by contrast, is creating those data consuming and data generating models. If models need to be available for decision making then those models should be productionised in algorithms.

Implication 3: Data Science results can be used by Analytics but Analytics results should rarely feed into Data Science

A business will want to report on the decisions made by the models productionised as algorithms. It therefore makes sense that tactical queries from Analytics could be run against algorithm output data.

Analytics, however, does not typically produce results that feed into Data Science work. Analytics can help inform the Data Scientist about the domain. After all, Analytics know the typical tactical reports and the typical KPIs the business uses. However, when it comes to model variables, the Data Scientist needs to figure that out for themselves and evaluate those variables as part of their experimental process.

Implication 4: Analytics has fewer dependencies than Data Science

Analytics can be run tactically off a data warehouse with few other dependencies beyond the right tools. Sure, without a reporting function to productionise Analytics, an organisation will never consolidate its tactical queries and will be forever in a state of panicked queries. But the Analytics will still get done and the business will get their information.

If Data Science models are not turned into engineered algorithms then an organisation is simply wasting its time and money on curiosities. Data Science benefits from the same warehouse of data as Analytics. But it also needs a mechanism to bring in other new data sources. It needs an engineering team to turn its models into algorithms. An engineering team needs a testing and support team to make sure things keep working. An any change in automation and decision making needs change management to make everybody comfortable with the work the algorithm will do.

Implication 5: Data Analytics to Data Science is a big career leap

The hype around Data Science has unhelpfully led to analytics professionals rebranding themselves. Awesome Analytics professionals are customer facing, know an organisation’s data inside out, can wrangle and manipulate data quickly and produce relevant and accurate business KPIs with little business steer. They tell a compelling story with visualizations. Their code will never be productionised. None of this helps them be awesome Data Scientists.

A significant part of a scientist’s work involves distilling a business objective into hypotheses. Experiments need to be designed to choose the right model and evaluate the robustness of that model. Experiments need to be assessed for significance, bias, confounding, blocking, correlations etc. And when a good model is found, a knowledge of software engineering for productionisation is required. What is the computational complexity? What data should be logged for that scientific reproducibility? What data needs to be filtered out to avoid model degradation and biased results? These are questions an Analytics professional shouldn’t need to ask.

Taking heart

Distinguishing Data Analytics from Data Science is not a competition. Both functions are clearly important for an organisation. You need the capability to mine your data for patterns and summaries that are not yet available in reporting. You need the capability to rigourously create models that can help automate decision making. Data Science may currently be more hyped than Analytics but perhaps a reckoning is coming. Models and even the model tuning process are becoming increasingly commoditised. Organisations will hopefully see sense and stop rewarding Data Scientists simply for knowing the APIs of a concoction of evolving programming libraries and will instead focus on the production of models that are understood and that they can have confidence in. That will only come from those Data Scientists who understand the scientific method and how to apply it.

13 Steps to Better Data Science: A Joel Test of Data Science Maturity

Data Science teams have different levels of maturity in terms of their ways of working. In the worst case, every team member works as an individual. Results are poorly explained and impossible to reproduce. In the best case, teams reach full scientific reproducibility with simple conventions and little overhead. This leads to efficiency and confidence in results and minimal friction in productionising models. It is important to be able to measure a team’s maturity so that you can improve your ways of working and so you can attract and retain great talent. This series of questions is a Joel Test of Data Science Maturity. As with Joel’s original test for software development, all questions are a simple Yes/No and a score below 10 is cause for concern. Depressingly, many teams seem to struggle around a 3.

A Joel Test of Data Science Maturity

  1. Are results reproducible?
  2. Do you use source control?
  3. Do you create a data pipeline that you can rebuild with one command?
  4. Do you manage delivery to a schedule?
  5. Do you capture your objectives (scientific hypotheses)?
  6. Do you rebuild pipelines frequently?
  7. Do you track bugs in your models and your pipeline code?
  8. Do you analyse the robustness of your models?
  9. Do you translate model performance to commercial KPIs?
  10. Do new candidates write code at interview?
  11. Do you have access to scalable compute and storage?
  12. Can Data Scientists install libraries and packages without intervention by IT?
  13. Can Data Scientists deploy their models with minimal dependencies on engineering and infrastructure?

 

1. Are results reproducible?

A core aspect of traditional science is that results be reproducible. This is essential when building models of the world that aim to improve our understanding of the world. It is no different for Data Science. And it turns out the reproducibility promotes efficiency. Teams no longer waste time wondering which data led to a particular result, which code led to a particular result and why results might have changed as understanding of the problem improved.

2. Do you use source control?

Building algorithms and data pipelines is complex. Source control lets you track changes to your code, roll back poor changes and try out new ideas without breaking working code.

3. Do you create a data pipeline that you can rebuild with one command?

A version controlled data pipeline allows you to centralise and consolidate your understanding of the data (business and cleaning rules) and your definition of features that feed into an algortihm. If you can rebuild this pipeline with one command then you can quickly iterate as your understanding of the problem evolves and as you inevitably discover issues with the data.

4. Do you manage delivery to a schedule?

Data science needs a schedule to keep it focused. As projects are often open ended and exploratory, you need to have clear checkpoints where you can make a call that perhaps ‘this data is not fit for purpose’ or ‘there is no value in further iterations of model refinement’. Teams that do not deliver to any schedule tend to drift into perfection being the enemy of done.

5. Do you capture your objectives?

Every data science problem is really an optimisation problem and you cannot optimse without an objective. Although it can sometimes feel painful or appear ‘picky’, it is essential that the objective of a project and a model are clearly defined. Increate profit? Increase volume? Increase both with some balance? Get clear and agree with your customer.

6. Do you rebuild pipelines often?

Like traditional software, rebuilding often can highlight integration bugs. In the context of data science integration bugs are effectively data flows through a pipeline. If you do not rebuild often it is possible to introduce cyclic references into your data preparation, lose the logic for creation of a feature and other nasty bugs that cause you to lose that essential reproducibility.

7. Do you track bugs in your model and in your pipeline code?

Data science model development is complex. It has many dependencies. Customer feedback and domain knowledge are incredibly valuable. Make sure you are tracking feedback so mistakes are not repeated and so your models are always improving.

8. Do you analyse the robustness of your models?

No model will work in all scenarios and poor performing models are dangerous. It is important to analyse and understand the conditions under which your model will work and under which it will degrade. This is robustness analysis. Are model outputs biased? Does a model require 6months of training data or 2 weeks? Does a model only perform once it has seen 5 customer journeys? A mature data science team has confidence pushing its models into production because this type of testing has been done in advance.

9. Do you translate model performance to commercial KPIs?

Technical performance metrics are important for you as a technical data scientist. However, to get business buy-in and adoption of your models you need to be able to make your models commercially relevant. That means turning predictions into revenue or cost savings or time savings or whatever the business cares about and whatever will justify further funding of your work.

10. Do new candidates write code at interview?

Data science is full of hype, bluffers and analytics rebranding itself. You want to filter down to the great candidates who understand the scientific method and can apply it to select and tune models. A technical test that involves using data and writing code is the most effective way to do this.

11. Do you have access to scalable compute and storage?

The complex combination of technologies needed for Data SCience often means that organisations struggle to enable their teams with the best technology to do their jobs. If your team does not have access to scalable compute and storage then their success will always be limited. Lack of a central place to store data and workings is a warning sign that Data Science is not taken seriously in an organisation.

12. Can data scientists install libraries and packages without intervention by IT?

If there is one word that summarises the requirements of Data Science it is ‘flexibility’. The nature of the work involves selecting models and tuning them against data. This means being able to quickly install and evaluate lots of model libraries. If a Data Science team needs approval for every library installation and upgrade then its speed of turnaround is going to slow from days to weeks and months.

13. Can Data Scientists deploy their models with minimal dependencies on engineering and infrastructure?

If models cannot be put into use they are of little value beyond curiousities. But deploying a model involves training on reproducible data, monitoring of decisions and performance and A/B testing of new releases. Delays in deployment mean models go out of date or competitive advantage is lost. The best organisations have platforms that allow model deployment to happen quickly, driven by Data Scientists.

So how do you score a 13/13?

How would your team score on a Joel Test of Data Science Maturity? This is where Guerrilla Analytics can help. Guerrilla Analytics provides guiding principles and conventions for promoting data provenance and reproducibiltiy in Data Science and Analytics work. There are guidelines on how to structure projects at every stage of the life cycle and how to consolidate knowledge in flexible data pipelines. You will also learn how to leverage techniques and tools from software engineering such as testing and source control.

Data Science jargon buster – for Data Scientists

Bamboozled. That’s your customers’ reaction to the Data Scientists in your organisation. Data Scientists need to communicate without jargon so customers understand, believe and care about their recommendations. Here is a Data Science jargon buster to help.

Data Science is a technical field that applies scientific rigour to the understanding of business data and the associated processes and products. Like traditional science, it can be full of jargon that leads to unclear business messages and a failure of findings to be understood and acted upon. Having reviewed countless Data Science reports and presentations that are not ready for business readers, I thought it would be useful to provide a list of business terms to replace Data Science jargon.

Data Science Jargon Buster

[table id=1 /]

 

Eradicating Data Science jargon from your team

Do you have some jargon you would like to eradicate from your team? Get in touch! Let’s build this list together.

This post is inspired by a recent talk at our office by the always entertaining and informative David Reed, currently at DataIQ. David cited a speaker (the name escapes me) who had put together a similar lookup from Data Science jargon to business terms that business users understand.

 

Reproducible Data Science: faster iterations, reviews and production

Data Science involves applying the scientific method to the discovery of opportunities and efficiencies in business data. An essential part of the scientific method is reproducibility. Reproducible Data Science is essential for scientific credibility but also improves your Data Science efficiency in 3 keys ways. If you start to apply the 7 Principles of Guerrilla Analytics your teams will quickly achieve reproducibility and benefit from these efficiencies.

[su_spacer size=”40″]

  1. Faster iterations: You can iterate quicker if your outputs are tracked, and the code version and source data for those outputs can be retrieved. This is the only way to keep track of what is an inherently complex process of changing data, changing understanding, changing requirements and combinations of inputs. With reproducibility under control, not only are you working at speed with your customer but you also gain their credibility because you are always on top of your evolving numbers and KPIs.
    [su_spacer size=”20″]
  2. Faster reviews: Your team can review and hand over work more easily when everything works out of the box. There is no struggle rebuilding environments. There are no repeated conversations over what code was throw away and what code is essential to the science being reviewed. There is no need to do team forensics just to understand what exactly your team sent out the door.
    [su_spacer size=”20″]
  3. Faster to production: a reproducible data processing pipeline, version controlled algorithm code and environments as code all reduce the friction in moving algorithms from a Data Science team into a production development team. Day 1: point data science code at production environment. Day 2: begin refining and refactoring without breaking functionality.

[su_spacer size=”40″]

If you are curious about how reproducible Data Science was achieved and maintained in teams of up to 15 analysts on large and fast paced projects then please have a look at the book “Guerrilla Analytics: a practical approach to working with data” (USA) (UK).

To Become A Data Scientist, Focus On Competencies before Skills

Programming language version 3.2. SQL, NoSQL, NewSQL. It seems that too often, the path to become a Data Scientist involves skills in vogue rather than more permanent competencies. In a fast paced field like Data Science, skills are more tangible. They can be directly tested. They can be dated to the latest technology or the latest language version.

Competencies are different. Competencies are a more general combination of skills, behaviours and knowledge.

You can have great Powerpoint skills creating beautiful slides but still be a terrible communicator. You can be skilled at Python syntax but still be a poor programmer. Communication and programming are competencies. It is competencies that are most important when you build a Data Science career that is robust to changing trends in skills like languages and technology platforms.

Here are some key competencies and example skills for successful Data Science.

  • Communication: data and data science are complex. You need to be able to really listen and understand the problem a customer wants solved. You need to be able to communicate your solution at the right level so your customer can take action.
    Typical skills include report writing, presentation, speaking and story telling.
  • Data Modelling: you will encounter data from a variety of systems. You will need to organise your project data for flexible efficient use over the course of a project. This means being able to model data.
    Typical skills include database design, SQL, normal forms, table design, indexing.
  • Data wrangling: you will need to reshape data so it can be visualized, profiled and made ready for algorithms.
    Typical skills include data manipulation libraries like pandas, languages like Python or R and visualization libraries.
  • Programming and Tuning Algorithms: ultimately, you will produce an algorithm that captures your data science insights. Algorithms need to be tuned to data and their performance robustness quantified. There will be scenarios where the algorithm works well and scenarios where it does not. There will be efficient structures and structures that give the same results but with dramatically reduced efficiency.
    Typical skills include a programming language, data structures, code testing, version control, complexity.
  • Data pipelining: once data is fairly well understood and candidate algorithms have been identified, you will begin iterating through data and algorithms to get to your final insights. This iteration is most effective when data can be torn down and rebuilt repeatably and quickly. This is a data build.
    Typical skills include SQL, pipeline management libraries, ETL design.
  • Design of Experiments: Data Science is about applying the scientific method to understand data. Being able to design and execute an experiment is essential to being able to test an algorithm or model in the wild and demonstrate cause rather than correlation.
    Typical skills include experiment layout, randomisation and blocking, statistical inference, and hypothesis testing.
  • Consulting: consulting is about influencing without power. Significant numbers of data science projects fail because data scientists could not convince their customers to change their business and use their data science. It may seem bizarre but like all aspects of change, data science has an impact on people, existing processes and existing technology.
    Typical skills include meeting and workshop facilitation, stakeholder management and mapping and influencing.
  • Project Management: last but not least, if you cannot run a project (even as the sole data scientist on a project) then all the above is irrelevant. Project management is a hugely diverse and complex field. However there are some key skills that will help your data science succeed.
    Typical skills include estimation, planning, budgeting, resourcing, and RAID management.

You can read more about the competencies and skills to become a Data Scientist in my book Guerrilla Analytics which you can read more about here  (USA) and here (UK).

The Rigour of Science is Essential for Successful Data Science in Business

A/B tests! Machine learning! Deep learning! It’s easy to be distracted by new libraries and beautiful visualizations. It’s easy to waste time with scattergun approaches to data and algorithms. It’s easy to forget that as a data scientist you should be taking a scientific approach to understanding data.

Is scientific rigour appropriate in industry or should it be confined to academic research? Is it not better to to design an experiment instead of diving into the data? And should you care about reproducibility or move quickly to the next project?

[su_highlight background=”#d9dcd2″]Actually, the rigour of Science is essential for successful Data Science in business.  [/su_highlight]The scientific method maps nicely to a sensible approach to data science projects in business. This post will show you how.

[su_spacer size=”30″]

A definition of Science

Here is a definition of ‘science’ followed by what that means for data science in business.

[su_quote cite=”Oxford English Dictionary” url=”https://en.oxforddictionaries.com/definition/science”%5Dthe systematic study of the structure and behaviour of the… world through observation and experiment[/su_quote]

[su_spacer size=”30″]

Without getting into a whole field of philosophy, this ‘systematic’ scientific method generally involves the following steps.

  1. Formulate a question
  2. Formulate a hypothesis
  3. Make a prediction
  4. Experiment
  5. Analyse results

[su_spacer size=”30″]

So how should data science apply this scientific method in business?

The Rigour of Science is Essential for Successful Data Science in Business

Here are the steps in the scientific method and a data science example.

The Business Objective

  • Formulation of a question. This is the most challenging and most important step for science in general and data science in business. This can be closed like ‘why are our sales decreasing?’ or can be open ended like ‘can we solve problem X?’
    • Example: our buying and marketing teams might engage data science to ask ‘can we predict what customers want to buy?’
    • Success in business: If you are not clear on your objective, you project is heading for trouble. Without a well formulated question, data projects become mired in complexity, go off course and fail.

[su_spacer size=”30″]

The Business Case

  • Formulation of a hypothesis (conjecture) about a population. This is a testable conjecture that rejects the status quo (in science jargon, this is the null hypothesis).
    • Example: a hypothesis might be that a logistic regression applied to established customers (the population) will make better predictions than chance (or the current algorithm in use).
    • Success in business: this is your business case. This is the outcome that would cause the business to change the way they work.

[su_spacer size=”30″]

  • Prediction. this is the logical consequence of the hypothesis. The more unlikely a prediction due to coincidence then the more likely the status quo should be rejected.
    • Example: we could predict an improvement in the number of correctly predicted purchases due to the new algorithm.
    • Success in business: this is an extension of the business case. If the business case were successful, then this is what we predict will happen.

[su_spacer size=”30″]

Evaluating the Business Case

  • Testing with experiment. Here is where the prediction is tested in the real world. Experiment design is a field in its own right that I will blog about separately. For our customer algorithm, we might put it live on an online website or use some other means to get new predictions in front of real world customers.
    • Success in business: this is where we rigorously evaluate the business case. Experiment design is critical to counteract biases, random chance and external influences that we cannot control.

[su_spacer size=”30″]

  • Analysing experiment results. This is where the data gathered from the experiment is analysed to determine if the status quo should be rejected. In the example, we would look for a significant difference in customer predictions using the new algorithm instead of the incumbent approach.
    • Success in business: this is where we rigorously decide where the business case will be a success. Because we are leveraging all the previous steps, our conclusion to reject the status quo and change the business can be done with confidence.

[su_spacer size=”30″]

Rigour of Science drives good data science practices

In addition, the following scientific principles should be adhered to:

  • Repeatability. It should be possible to run an experiment again and get the same results and conclusions.
    • Success in business: this is how you have confidence that your results are generally applicable to the business. You can take that algorithm out of the lab and run it in production.

[su_spacer size=”30″]

  • Reproducibility. It should be possible for somebody else to independently following your experiment steps and get the same results and conclusions
    • Success in business: many businesses are seasonal. Teams change, projects pause and are restarted. Reproducibility allows other teams to successfully inherit your work and use it with confidence.

[su_spacer size=”30″]

  • Avoidance of bias. This can be human bias due to our inherent subjectivity but can also inadvertently be introduced through poor experiment design.
    • Success in business: This is how you stop those great results being extrapolated and interpolated in ways your never intended.

[su_spacer size=”30″]

You can read more about how to bring scientific rigour to your data science in my book Guerrilla Analytics (UK) (USA). It contains simple principles for maintaining reproducibility and repeatability while delivering at pace.

If you would like to discuss further please get in touch!