Building a Data Science Capability: Inspirational Keynote at Data Leaders Summit Europe

I’ve just delivered the inspirational keynote at Data Leaders Summit Europe, 2018. Lots of great engagement and feedback. In particular, it seems people liked a clear definition of what data science actually is and the practical steps (miss-steps) I took in building a capability at Sainsbury’s.

Continue reading “Building a Data Science Capability: Inspirational Keynote at Data Leaders Summit Europe”

13 Steps to Better Data Science: A Joel Test of Data Science Maturity

Data Science teams have different levels of maturity in terms of their ways of working. In the worst case, every team member works as an individual. Results are poorly explained and impossible to reproduce. In the best case, teams reach full scientific reproducibility with simple conventions and little overhead. This leads to efficiency and confidence in results and minimal friction in productionising models. It is important to be able to measure a team’s maturity so that you can improve your ways of working and so you can attract and retain great talent. This series of questions is a Joel Test of Data Science Maturity. As with Joel’s original test for software development, all questions are a simple Yes/No and a score below 10 is cause for concern. Depressingly, many teams seem to struggle around a 3.

Continue reading “13 Steps to Better Data Science: A Joel Test of Data Science Maturity”

Data Science jargon buster – for Data Scientists

Data Scientists need to communicate without jargon so customers understand, believe and care about their recommendations. Here is a Data Science jargon buster to help with communicating data science project results.

Bamboozled. That’s your customers’ reaction to the Data Scientists in your organisation. Data Scientists need to communicate without jargon so customers understand, believe and care about their recommendations. Here is a Data Science jargon buster to help with communicating data science project results.

Continue reading “Data Science jargon buster – for Data Scientists”

Reproducible Data Science: faster iterations, reviews and production

Data Science involves applying the scientific method to the discovery of opportunities and efficiencies in business data. An essential part of the scientific method is reproducibility. Reproducible Data Science is essential for scientific credibility but also improves your Data Science efficiency in 3 keys ways – faster iterations, reviews and pushes to production.

Data Science involves applying the scientific method to the discovery of opportunities and efficiencies in business data. An essential part of the scientific method is reproducibility. Reproducible Data Science is essential for scientific credibility but also improves your Data Science efficiency in 3 keys ways – faster iterations, reviews and pushes to production.
If you start to apply the 7 Principles of Guerrilla Analytics your teams will quickly achieve reproducibility and benefit from these efficiencies.

Continue reading “Reproducible Data Science: faster iterations, reviews and production”

To Become A Data Scientist, Focus On Competencies before Skills

Too often, the path to becoming a Data Scientist focuses on technology skills in vogue rather than more permanent competencies. Competencies are a more general combination of skills, behaviours and knowledge. You can have great Powerpoint skills creating beautiful slides but still be a terrible communicator. It is competencies that are most important when you build a Data Science career that is robust to changing trends in skills like languages and technology platforms. This post describes the most important competencies for being successful in data science.

Too often, the path to becoming a Data Scientist focuses on technology skills in vogue rather than more permanent competencies. Competencies are a more general combination of skills, behaviours and knowledge. You can have great Powerpoint skills creating beautiful slides but still be a terrible communicator. It is competencies that are most important when you build a Data Science career that is robust to changing trends in skills like languages and technology platforms. This post describes the most important competencies for being successful in data science.

Continue reading “To Become A Data Scientist, Focus On Competencies before Skills”

The Rigour of Science is Essential for Successful Data Science in Business

The rigour of Science is essential for successful Data Science in business. The scientific method helps drive successful data science projects in business. This post will show you how.

The rigour of Science is essential for successful Data Science in business. The scientific method helps drive successful data science projects in business. This post will show you how.

Continue reading “The Rigour of Science is Essential for Successful Data Science in Business”

Data Science – A Definition And How To Get Started

Confusion, hype, failure to start. Data Science has huge potential to change an organisation. But many organisations become mired in the associated cultural, technological and people change. Data Science is delivered as an interesting report rather than a driver of change. Data Science identifies algorithms that are run in the safety of a lab but never make it into production.

This is my keynote talk from the Polish Data Science with Business Conference.

Confusion, hype, failure to start. Data Science has huge potential to change an organisation. But many organisations become mired in the associated cultural, technological and people change. Data Science is delivered as an interesting report rather than a driver of change. Data Science identifies algorithms that are run in the safety of a lab but never make it into production.

This is my keynote talk from the Polish Data Science with Business Conference.

Continue reading “Data Science – A Definition And How To Get Started”

Irish Language Data Science lecture at Engineers Ireland

I gave the following lecture to Engineers Ireland which is the professional body for Engineers. It’s about “Data Science and the benefits for engineering” and is entirely in Irish.

It was an interesting exercise to brush up on my Gaelic and also to see the wealth or resources that now exist for using Irish with modern technical vocabulary. If you are curious or are trying to get your Gaelic up to scratch then please get in touch!

I may be the first person to coin Data Science in Gaeilge!

I gave the following lecture to Engineers Ireland which is the Irish professional body for Engineers. The lecture is about “Data Science and the benefits for engineering” and is entirely in Irish.

It was an interesting exercise to brush up on my Gaelic and also to see the wealth of resources that now exist for using Irish with modern technical vocabulary. If you are curious or are trying to get your Gaelic up to scratch then please get in touch!

In terms of content, it covers what data and data science look like and how traditional engineering problems might benefit from the application of data science.

The full video is linked below.

And here are the slides 2016-04 Engineering Ireland_04_Gaeilge.

Data Science Patterns: Preparing Data for Agile Data Science

Are you a data scientist working on a project with constantly changing requirements, flawed changing data and other disruptions? Guerrilla Analytics can help.

The key to a high performing Guerrilla Analytics team is its ability to recognise common data preparation patterns and quickly implement them in flexible, defensive data sets.

After this webinar, you’ll be able to get your team off the ground fast and begin demonstrating value to your stakeholders.

You will learn about:
* Guerrilla Analytics: a brief introduction to what it is and why you need it for your agile data science ambitions
* Data Science Patterns: what they are and how they enable agile data science
* Case study: a walk through of some common patterns in use inreal projects

I recently gave a webinar on Data Science Patterns. The slides are here.

Data Science Patterns, as with Software Engineering Patterns, are ‘common solutions to recurring problems’. I was inspired to put this webinar together based on a few things.

  • I build Data Science teams. Repeatedly, you find teams working inconsistently in terms of the data preparation approaches, structures and conventions they use. Patterns help resolve this problem. Without patterns, you end up with code maintenance challenges, difficulty in supporting junior team members and all round team inefficiency due to having a completely ad-hoc approach to data preparation.
  • I read a recent paper ‘Tidy Data’ by Hadley Wickham in the Journal of Statistical Software http://vita.had.co.nz/papers/tidy-data.pdf. This paper gives an excellent clear description of what ‘tidy data’ is – the data format used by most Data Science algorithms and visualizations. While there isn’t anything new here if you have a computer science background, Wickham’s paper is an easy read and has some really clear worked examples.
  • My book, Guerrilla Analytics (here for USA or here for UK), has an entire appendix on data manipulation patterns and I wanted to share some of that thinking with the Data Science community.

I hope you enjoy the webinar and find it useful. You can hear the recording here. Do get in touch with your thoughts and comments as I think Data Science patterns is a huge area of potential improvement in the maturity of Data Science as a field.

How Do I Avoid Bias In My Data Science Work?

The danger of bias hasn’t been given enough consideration in Data Science. Bias is anything that would cause us to skew our conclusions and not treat results and evidence objectively. Bias is sometimes unavoidable, sometimes accidental and unfortunately sometimes deliberate. While bias is well recognised as a danger in mainstream science, I think Data Science could benefit from improving in this area. In this post I categorise the types of bias encountered in typical Data Science work. I have gathered these from recent blog posts [1], [2], [3] and a discussion in my PhD thesis [4]. I also show how to reduce bias using some of the principles you can learn about in Guerrilla Analytics: A Practical Approach to Working with Data.

Comparing apples and bananas on a scale

The danger of bias hasn’t been given enough consideration in Data Science. Bias is anything that would cause us to skew our conclusions and not treat results and evidence objectively. It is sometimes unavoidable, sometimes accidental and unfortunately sometimes deliberate. While bias is well recognised as a danger in mainstream science, I think Data Science could benefit from improving in this area.

In this post I categorise the types of bias encountered in typical Data Science work. I have gathered these from recent blog posts [1], [2], [3] and a discussion in my PhD thesis [4]. I also show how to reduce bias using some of the principles you can learn about in Guerrilla Analytics: A Practical Approach to Working with Data.

8 Types of Bias in Data Science

The first step is to be aware of the types of bias you may encounter.

  1. Confirmation bias. People are less critical of Data Science that supports their prior beliefs rather than challenges their convictions.
    • This happens when results that go against the grain are rejected in favour of results that promote ‘business as usual’. Was the latest quarterly marketing campaign really successful across the board or just for one part of the division?
  2. Rescue bias. This bias involves selectively finding faults in an experiment that contradicts expectations. It is generally a deliberate attempt to evade and undermine evidence.
    • You may fall for this bias when your project results are disappointing. Perhaps your algorithm can’t classify well enough. Perhaps the data is too sparse. The Data Scientist tries to imply that results would have been different had the experiment been different. Doing this is effectively drawing conclusions without data and without experiments.
  3. ‘Time will tell’ bias. Taking time to gather more evidence should increase our confidence in a result. This bias affects the amount of such evidence that is deemed necessary to accept the results.
    • You may encounter this bias when a project is under pressure to plough ahead rather than waiting for more data and more confident Data Science. Should you draw conclusions based on one store or wait until you have more data from a wide variety of stores and several seasons?
  4. Orientation bias. This reflects a phenomenon of experimental and recording error being in the direction that supports the hypothesis.
    • You may encounter this bias when your work is needed to support a business decision that has already been made. This arises in the pharmaceuticals industry, for example, where trials favour the new pharmaceutical drugs.
  5. Cognitive bias: This is the tendency to make skewed decisions based on pre-existing factors rather than on the data and other hard evidence.
    • This might be encountered where the Data Scientist has to argue against a ‘hunch’ from ‘experience’ that is not supported by hard data.
  6. Selection bias: This is the tendency to skew your choice of data sources to those that may be most available, convenient and cost-effective for your purposes.
    • You will encounter this bias when you have to ‘demonstrate value’ on a project that has not been properly planned. The temptation is to do ‘best endeavors’ with the data available.
  7. Sampling bias: This is the tendency to skew the sampling of data sets toward subgroups of the population.
    • An oft-quoted example here is the use of Twitter data to make broad inferences about the population. It turns out that the Twitter users sample is biased towards certain locations, certain incomes and education levels etc.
  8. Modelling bias: This is the tendency to skew Data Science models by starting with a biased set of assumptions about the problem. This leads to selection of the wrong variables, the wrong data, the wrong algorithms and the wrong metrics.

Reducing Bias

So what can you do to counter these biases in your work?

The first step is awareness and hopefully the above list will help you and your colleagues. If you know about bias, you can remain alert to it in your own work and that of others. Be critical and always challenge assumptions and designs.

The next best thing is to do what scientists do and make your work as reproducible and transparent as possible.

  • Track your data sources and profile your raw data as much as possible. Look at direct metrics from your data such as distributions and ranges. But also look at the qualitative information about the data. Where did it come from? How representative is this?
  • Make sure your data transformations and their influence on your populations can be clearly summarised. Are you filtering data? Why and so what? How are you calculating your variables and have you evaluated alternatives? Where is the evidence for your decision?
  • Track all your work products and data understanding as they evolve with the project. This allows you to look back at the exploration routes you discarded or didn’t have time to pursue.

Conclusion

Bias is sometimes unavoidable because of funding, politics or resources constraints. However that does not mean you can ignore bias. Recognising the types of bias, and understanding their impact on your conclusions will make you a better Data Scientist and improve the quality of your conclusions.

You can read more about how to do reproducible, testable Data Science that helps defend against bias in my book Guerrilla Analytics: A Practical Approach to Working with Data. Can you think of any other biases? Please get in touch!

References

  1. Data Scientist: Bias, Backlash and Brutal Self-Criticism, James Kobielus, MAY 16, 2013, http://www.ibmbigdatahub.com/blog/data-scientist-bias-backlash-and-brutal-self-criticism
  2. The Hidden Biases in Big Data, Kate Crawford APRIL 01, 2013, https://hbr.org/2013/04/the-hidden-biases-in-big-data
  3. 7 Common Biases That Skew Big Data Results, 9th July 2015 Lisa Morgan, http://www.informationweek.com/big-data/big-data-analytics/7-common-biases-that-skew-big-data-results/d/d-id/1321211
  4. Design of Experiments for the Tuning of Optimisation Algorithms, 2004, University of York, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.332.9333&rep=rep1&type=pdf