How Do I Avoid Bias In My Data Science Work?

The danger of bias hasn’t been given enough consideration in Data Science. Bias is anything that would cause us to skew our conclusions and not treat results and evidence objectively. Bias is sometimes unavoidable, sometimes accidental and unfortunately sometimes deliberate. While bias is well recognised as a danger in mainstream science, I think Data Science could benefit from improving in this area. In this post I categorise the types of bias encountered in typical Data Science work. I have gathered these from recent blog posts [1], [2], [3] and a discussion in my PhD thesis [4]. I also show how to reduce bias using some of the principles you can learn about in Guerrilla Analytics: A Practical Approach to Working with Data.

Comparing apples and bananas on a scale

The danger of bias hasn’t been given enough consideration in Data Science. Bias is anything that would cause us to skew our conclusions and not treat results and evidence objectively. It is sometimes unavoidable, sometimes accidental and unfortunately sometimes deliberate. While bias is well recognised as a danger in mainstream science, I think Data Science could benefit from improving in this area.

In this post I categorise the types of bias encountered in typical Data Science work. I have gathered these from recent blog posts [1], [2], [3] and a discussion in my PhD thesis [4]. I also show how to reduce bias using some of the principles you can learn about in Guerrilla Analytics: A Practical Approach to Working with Data.

8 Types of Bias in Data Science

The first step is to be aware of the types of bias you may encounter.

  1. Confirmation bias. People are less critical of Data Science that supports their prior beliefs rather than challenges their convictions.
    • This happens when results that go against the grain are rejected in favour of results that promote ‘business as usual’. Was the latest quarterly marketing campaign really successful across the board or just for one part of the division?
  2. Rescue bias. This bias involves selectively finding faults in an experiment that contradicts expectations. It is generally a deliberate attempt to evade and undermine evidence.
    • You may fall for this bias when your project results are disappointing. Perhaps your algorithm can’t classify well enough. Perhaps the data is too sparse. The Data Scientist tries to imply that results would have been different had the experiment been different. Doing this is effectively drawing conclusions without data and without experiments.
  3. ‘Time will tell’ bias. Taking time to gather more evidence should increase our confidence in a result. This bias affects the amount of such evidence that is deemed necessary to accept the results.
    • You may encounter this bias when a project is under pressure to plough ahead rather than waiting for more data and more confident Data Science. Should you draw conclusions based on one store or wait until you have more data from a wide variety of stores and several seasons?
  4. Orientation bias. This reflects a phenomenon of experimental and recording error being in the direction that supports the hypothesis.
    • You may encounter this bias when your work is needed to support a business decision that has already been made. This arises in the pharmaceuticals industry, for example, where trials favour the new pharmaceutical drugs.
  5. Cognitive bias: This is the tendency to make skewed decisions based on pre-existing factors rather than on the data and other hard evidence.
    • This might be encountered where the Data Scientist has to argue against a ‘hunch’ from ‘experience’ that is not supported by hard data.
  6. Selection bias: This is the tendency to skew your choice of data sources to those that may be most available, convenient and cost-effective for your purposes.
    • You will encounter this bias when you have to ‘demonstrate value’ on a project that has not been properly planned. The temptation is to do ‘best endeavors’ with the data available.
  7. Sampling bias: This is the tendency to skew the sampling of data sets toward subgroups of the population.
    • An oft-quoted example here is the use of Twitter data to make broad inferences about the population. It turns out that the Twitter users sample is biased towards certain locations, certain incomes and education levels etc.
  8. Modelling bias: This is the tendency to skew Data Science models by starting with a biased set of assumptions about the problem. This leads to selection of the wrong variables, the wrong data, the wrong algorithms and the wrong metrics.

Reducing Bias

So what can you do to counter these biases in your work?

The first step is awareness and hopefully the above list will help you and your colleagues. If you know about bias, you can remain alert to it in your own work and that of others. Be critical and always challenge assumptions and designs.

The next best thing is to do what scientists do and make your work as reproducible and transparent as possible.

  • Track your data sources and profile your raw data as much as possible. Look at direct metrics from your data such as distributions and ranges. But also look at the qualitative information about the data. Where did it come from? How representative is this?
  • Make sure your data transformations and their influence on your populations can be clearly summarised. Are you filtering data? Why and so what? How are you calculating your variables and have you evaluated alternatives? Where is the evidence for your decision?
  • Track all your work products and data understanding as they evolve with the project. This allows you to look back at the exploration routes you discarded or didn’t have time to pursue.

Conclusion

Bias is sometimes unavoidable because of funding, politics or resources constraints. However that does not mean you can ignore bias. Recognising the types of bias, and understanding their impact on your conclusions will make you a better Data Scientist and improve the quality of your conclusions.

You can read more about how to do reproducible, testable Data Science that helps defend against bias in my book Guerrilla Analytics: A Practical Approach to Working with Data. Can you think of any other biases? Please get in touch!

References

  1. Data Scientist: Bias, Backlash and Brutal Self-Criticism, James Kobielus, MAY 16, 2013, http://www.ibmbigdatahub.com/blog/data-scientist-bias-backlash-and-brutal-self-criticism
  2. The Hidden Biases in Big Data, Kate Crawford APRIL 01, 2013, https://hbr.org/2013/04/the-hidden-biases-in-big-data
  3. 7 Common Biases That Skew Big Data Results, 9th July 2015 Lisa Morgan, http://www.informationweek.com/big-data/big-data-analytics/7-common-biases-that-skew-big-data-results/d/d-id/1321211
  4. Design of Experiments for the Tuning of Optimisation Algorithms, 2004, University of York, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.332.9333&rep=rep1&type=pdf

Big Data Debate: The Controversial Questions at Google campus

I was recently invited to take part on the panel at the Big Data Debate (@bigdatadebate) at Google’s campus near Old Street in London1.

Big Data Debate 2

It was a great opportunity to meet like minded folks such as Christian Prokopp @prokopp Rangespan, Paul Bradshaw @paulbradshaw, Duncan Ross @duncan3ross Teradata, Daniel Hulme Satalia, Michael Cutler @cotdp TUMRA, Andy Piper @andypiper Pivotal and Will Scott Moncrieff  from DueDil. Overall it was an interesting debate with some interesting contributions from the panel and the packed house.

Big Data Debate 3

We spent perhaps half of the panel hour and most of the audience questions on data privacy. I guess this is revealing in itself if such concerns are at the forefronts of the public’s mind as opposed to the opportunities presented by data analytics.

Christian did start one controversial question with me. Paraphrasing, it was around the dangers that arise when we have the potential to mine vast quantities of data looking for patterns. My answer, as it has been since my PhD days is that this is simply poor methodology *whatever* the volumes of data you are analysing. A data science methodology should allow us to answer questions (test hypotheses) about a problem (as described by data) while reducing bias as far as possible. Think about that. If you go trawling for an effect that you expect to exist in data you will eventually find it. Instead, your approach should be:

  • understand the problem (talk to the business, formulate a scientific theory)
  • turn the problem into hypotheses (our campaign increased sales, a fraudulent user has a log pattern that is different from his peers etc)
  • decide what effect is practically significant
  • then you go and apply an appropriate statistical test with the correct sample size, and power. You check the test’s assumptions.
  • when you don’t find what you were looking for, you can’t keep changing your effect sizes and revisiting the data! That’s cherry picking or a confirmation bias.

So I suppose the answer to Christian’s question is a ‘yes’ but it has nothing to do with ‘Big Data’. Big Data is dangerous because new tools and hype can lead to folks forgetting that garbage in results in garbage out. You have to understand the data and the rigorous analysis you are applying – just like any scientist.

Here are some recent good reads:

[1] I am employed by KPMG, one of the event sponsors