How Do I Avoid Bias In My Data Science Work?

The danger of bias hasn’t been given enough consideration in Data Science. Bias is anything that would cause us to skew our conclusions and not treat results and evidence objectively. Bias is sometimes unavoidable, sometimes accidental and unfortunately sometimes deliberate. While bias is well recognised as a danger in mainstream science, I think Data Science could benefit from improving in this area. In this post I categorise the types of bias encountered in typical Data Science work. I have gathered these from recent blog posts [1], [2], [3] and a discussion in my PhD thesis [4]. I also show how to reduce bias using some of the principles you can learn about in Guerrilla Analytics: A Practical Approach to Working with Data.

Comparing apples and bananas on a scale

The danger of bias hasn’t been given enough consideration in Data Science. Bias is anything that would cause us to skew our conclusions and not treat results and evidence objectively. It is sometimes unavoidable, sometimes accidental and unfortunately sometimes deliberate. While bias is well recognised as a danger in mainstream science, I think Data Science could benefit from improving in this area.

In this post I categorise the types of bias encountered in typical Data Science work. I have gathered these from recent blog posts [1], [2], [3] and a discussion in my PhD thesis [4]. I also show how to reduce bias using some of the principles you can learn about in Guerrilla Analytics: A Practical Approach to Working with Data.

8 Types of Bias in Data Science

The first step is to be aware of the types of bias you may encounter.

  1. Confirmation bias. People are less critical of Data Science that supports their prior beliefs rather than challenges their convictions.
    • This happens when results that go against the grain are rejected in favour of results that promote ‘business as usual’. Was the latest quarterly marketing campaign really successful across the board or just for one part of the division?
  2. Rescue bias. This bias involves selectively finding faults in an experiment that contradicts expectations. It is generally a deliberate attempt to evade and undermine evidence.
    • You may fall for this bias when your project results are disappointing. Perhaps your algorithm can’t classify well enough. Perhaps the data is too sparse. The Data Scientist tries to imply that results would have been different had the experiment been different. Doing this is effectively drawing conclusions without data and without experiments.
  3. ‘Time will tell’ bias. Taking time to gather more evidence should increase our confidence in a result. This bias affects the amount of such evidence that is deemed necessary to accept the results.
    • You may encounter this bias when a project is under pressure to plough ahead rather than waiting for more data and more confident Data Science. Should you draw conclusions based on one store or wait until you have more data from a wide variety of stores and several seasons?
  4. Orientation bias. This reflects a phenomenon of experimental and recording error being in the direction that supports the hypothesis.
    • You may encounter this bias when your work is needed to support a business decision that has already been made. This arises in the pharmaceuticals industry, for example, where trials favour the new pharmaceutical drugs.
  5. Cognitive bias: This is the tendency to make skewed decisions based on pre-existing factors rather than on the data and other hard evidence.
    • This might be encountered where the Data Scientist has to argue against a ‘hunch’ from ‘experience’ that is not supported by hard data.
  6. Selection bias: This is the tendency to skew your choice of data sources to those that may be most available, convenient and cost-effective for your purposes.
    • You will encounter this bias when you have to ‘demonstrate value’ on a project that has not been properly planned. The temptation is to do ‘best endeavors’ with the data available.
  7. Sampling bias: This is the tendency to skew the sampling of data sets toward subgroups of the population.
    • An oft-quoted example here is the use of Twitter data to make broad inferences about the population. It turns out that the Twitter users sample is biased towards certain locations, certain incomes and education levels etc.
  8. Modelling bias: This is the tendency to skew Data Science models by starting with a biased set of assumptions about the problem. This leads to selection of the wrong variables, the wrong data, the wrong algorithms and the wrong metrics.

Reducing Bias

So what can you do to counter these biases in your work?

The first step is awareness and hopefully the above list will help you and your colleagues. If you know about bias, you can remain alert to it in your own work and that of others. Be critical and always challenge assumptions and designs.

The next best thing is to do what scientists do and make your work as reproducible and transparent as possible.

  • Track your data sources and profile your raw data as much as possible. Look at direct metrics from your data such as distributions and ranges. But also look at the qualitative information about the data. Where did it come from? How representative is this?
  • Make sure your data transformations and their influence on your populations can be clearly summarised. Are you filtering data? Why and so what? How are you calculating your variables and have you evaluated alternatives? Where is the evidence for your decision?
  • Track all your work products and data understanding as they evolve with the project. This allows you to look back at the exploration routes you discarded or didn’t have time to pursue.

Conclusion

Bias is sometimes unavoidable because of funding, politics or resources constraints. However that does not mean you can ignore bias. Recognising the types of bias, and understanding their impact on your conclusions will make you a better Data Scientist and improve the quality of your conclusions.

You can read more about how to do reproducible, testable Data Science that helps defend against bias in my book Guerrilla Analytics: A Practical Approach to Working with Data. Can you think of any other biases? Please get in touch!

References

  1. Data Scientist: Bias, Backlash and Brutal Self-Criticism, James Kobielus, MAY 16, 2013, http://www.ibmbigdatahub.com/blog/data-scientist-bias-backlash-and-brutal-self-criticism
  2. The Hidden Biases in Big Data, Kate Crawford APRIL 01, 2013, https://hbr.org/2013/04/the-hidden-biases-in-big-data
  3. 7 Common Biases That Skew Big Data Results, 9th July 2015 Lisa Morgan, http://www.informationweek.com/big-data/big-data-analytics/7-common-biases-that-skew-big-data-results/d/d-id/1321211
  4. Design of Experiments for the Tuning of Optimisation Algorithms, 2004, University of York, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.332.9333&rep=rep1&type=pdf

Guerrilla Analytics: Tactics for Coping with Data Science Reality

man-65049_1920

Here are the slides from a talk I gave today to the Information Technology Department at the National University of Ireland, Galway. Thanks to Michael Madden for the opportunity to speak.

The talk was about how Guerrilla Analytics principles and practice tips help you do Data Science in circumstances that are very dynamic, constrained and yet required traceability of what you do.

There were plenty of questions afterwards which is always encouraging. I’ll try to address these questions in subsequent blog posts so please do follow me @enda_ridge for all the latest posts.

Here are some of the questions from today.

  • what are the key skills to focus on if you want to work in data analytics / data science?
  • is programming ability a pre-requisite for doing data science? This question came up before at Newcastle University.
  • do the guerrilla analytics principles map to research projects?
  • do the guerrilla analytics principles map to ‘big data’ projects?

Since NUI Galway is a bi-lingual university, you can find my broken Gaelic version below!

As Gaeilge

Seo h-iad na sleamhnáin ó léacht a bhí agam inniú sa Roinn Teicneolaíocht Fáisnéise in Ollscoil na h-Éireann, Gaillimh. Buíochas le Michael Madden as an deis labhairt.

Bhain an léacht le cén chaoi is féidir leis na  prionsabail agus noda Guerrilla Analytics cabhair leat agus tú ag déanamh Data Science i ndálaí atá dinimic, srianta ach fós tá sé riachtanach go bhfuil inrianaitheacht ann.

Bhí mórán ceisteanna tar éis an léacht agus is maith an rud é. Freagróidh mé iad i mblag eile agus bígí cinnte mé a leanacht ag @enda_ridge don scéal is déanaí.

Seo h-iad roinnt de na ceisteanna.

  • céard iad na scilleanna is tábhachtaí agus tú ag iarraidh obair mar data scientist?
  • an gá duit bheith in ann ríomhchlárú le h-aghaidh obair mar data scientist?
  • an bhfuil baint ann idir na prionsabail agus tionscadail taighde?
  • an bhfuil baint ann idir na prionsabail agus ‘Big Data’?

3 Lessons I Learned From Writing a Data Science Book – ‘Guerrilla Analytics – a practical approach to working with data’

One of the biggest challenges with writing a significant piece like a book chapter or entire book is to estimate how long it will take and plan accordingly. My best reference was my PhD which was still significantly shorter than the book’s target 90,000 words. This blog post is about the book writing process as I experienced it. I hope it helps other authors setting out on such an endeavour.

Since ‘Guerrilla Analytics: A Practical Approach to Working with Data’ is about operational aspects of agile data science, I recorded some data on the book writing process itself. Specifically, every time I finished a writing session, I recorded the number of words I’d written on that date.

Writing

One of the biggest challenges with writing a significant piece like a book chapter or entire book is to estimate how long it will take and plan accordingly. My best reference was my PhD which was still significantly shorter than the book’s target 90,000 words. This blog post is about the book writing process as I experienced it. I hope it helps other authors setting out on such an endeavour.

Since ‘Guerrilla Analytics: A Practical Approach to Working with Data‘ is about operational aspects of agile data science, I recorded some data on the book writing process itself. Specifically, every time I finished a writing session, I recorded the number of words I’d written on that date.

My 3 Lessons

  • Progress tapers off. You’ll get more work done in the first half of your project. Don’t expect this rate of progress to be sustained all the way to your deadline.
  • Be realistic about how much you can write in a session. I found it difficult to write more than 1,500 words. Anything more was the exception for me. Track your progress and re-plan accordingly.
  • Weekends are better than weekdays. Obvious maybe! Expect to set aside your free time on weekends to get your project over the line. It is difficult to get significant amounts of work done on weekdays.

Progress tapers off

  • Here is my progress towards my goal of 90,000 words over an 8 month period. The plot shows the words written per session and the total word count.writing_log_progress

    I began writing in late September and finished in June the following year. The line shows my total words written and the bars show the number of words written in individual writing sessions. Two things stand out:

  • Progress is faster in the first half of the project. This was because it is easier to get all your ideas ‘onto paper’ early in the writing. Once you have about 3 quarters of your manuscript complete, you need to be more careful about consistency of language and flow of content. This slows you down.
  • Time off work is really productive. There are two clear bursts of productivity as shown by the dense groups of grey bars where a large number of words was written in many successive sessions. The two periods are Halloween (when I took a week off work) and Christmas when I worked for a week from my family home.

How much did I write in a typical session?

Here’s how much I wrote in each writing session.

Words per session

I typically wrote about 1,000 words with the odd session where I wrote over 3,000 words. This is important when you plan your project. If you’re anything like me, writing more than 1,000 words will be an exception. If you only write on weekends then you’re looking at only 2,000 words per week. That’s well under 100,000 words in a year allowing for holidays and other disruptions.

Are you thinking about writing something and have questions? Feel free to get in touch and best of luck!

Guerrilla Analytics at the Business Intelligence Congress, Orlando December 2013

@edcurry and I recently presented on Guerrilla Analytics at the Business Intelligence Congress in Orlando, Florida. The slides are here. These are some early thoughts on Guerrilla Analytics, what it is and the principles involved.