How Do I Avoid Bias In My Data Science Work?

Comparing apples and bananas on a scale

The danger of bias hasn’t been given enough consideration in Data Science. Bias is anything that would cause us to skew our conclusions and not treat results and evidence objectively. It is sometimes unavoidable, sometimes accidental and unfortunately sometimes deliberate. While bias is well recognised as a danger in mainstream science, I think Data Science could benefit from improving in this area.

In this post I categorise the types of bias encountered in typical Data Science work. I have gathered these from recent blog posts [1], [2], [3] and a discussion in my PhD thesis [4]. I also show how to reduce bias using some of the principles you can learn about in Guerrilla Analytics: A Practical Approach to Working with Data.

8 Types of Bias in Data Science

The first step is to be aware of the types of bias you may encounter.

  1. Confirmation bias. People are less critical of Data Science that supports their prior beliefs rather than challenges their convictions.
    • This happens when results that go against the grain are rejected in favour of results that promote ‘business as usual’. Was the latest quarterly marketing campaign really successful across the board or just for one part of the division?
  2. Rescue bias. This bias involves selectively finding faults in an experiment that contradicts expectations. It is generally a deliberate attempt to evade and undermine evidence.
    • You may fall for this bias when your project results are disappointing. Perhaps your algorithm can’t classify well enough. Perhaps the data is too sparse. The Data Scientist tries to imply that results would have been different had the experiment been different. Doing this is effectively drawing conclusions without data and without experiments.
  3. ‘Time will tell’ bias. Taking time to gather more evidence should increase our confidence in a result. This bias affects the amount of such evidence that is deemed necessary to accept the results.
    • You may encounter this bias when a project is under pressure to plough ahead rather than waiting for more data and more confident Data Science. Should you draw conclusions based on one store or wait until you have more data from a wide variety of stores and several seasons?
  4. Orientation bias. This reflects a phenomenon of experimental and recording error being in the direction that supports the hypothesis.
    • You may encounter this bias when your work is needed to support a business decision that has already been made. This arises in the pharmaceuticals industry, for example, where trials favour the new pharmaceutical drugs.
  5. Cognitive bias: This is the tendency to make skewed decisions based on pre-existing factors rather than on the data and other hard evidence.
    • This might be encountered where the Data Scientist has to argue against a ‘hunch’ from ‘experience’ that is not supported by hard data.
  6. Selection bias: This is the tendency to skew your choice of data sources to those that may be most available, convenient and cost-effective for your purposes.
    • You will encounter this bias when you have to ‘demonstrate value’ on a project that has not been properly planned. The temptation is to do ‘best endeavors’ with the data available.
  7. Sampling bias: This is the tendency to skew the sampling of data sets toward subgroups of the population.
    • An oft-quoted example here is the use of Twitter data to make broad inferences about the population. It turns out that the Twitter users sample is biased towards certain locations, certain incomes and education levels etc.
  8. Modelling bias: This is the tendency to skew Data Science models by starting with a biased set of assumptions about the problem. This leads to selection of the wrong variables, the wrong data, the wrong algorithms and the wrong metrics.

Reducing Bias

So what can you do to counter these biases in your work?

The first step is awareness and hopefully the above list will help you and your colleagues. If you know about bias, you can remain alert to it in your own work and that of others. Be critical and always challenge assumptions and designs.

The next best thing is to do what scientists do and make your work as reproducible and transparent as possible.

  • Track your data sources and profile your raw data as much as possible. Look at direct metrics from your data such as distributions and ranges. But also look at the qualitative information about the data. Where did it come from? How representative is this?
  • Make sure your data transformations and their influence on your populations can be clearly summarised. Are you filtering data? Why and so what? How are you calculating your variables and have you evaluated alternatives? Where is the evidence for your decision?
  • Track all your work products and data understanding as they evolve with the project. This allows you to look back at the exploration routes you discarded or didn’t have time to pursue.

Conclusion

Bias is sometimes unavoidable because of funding, politics or resources constraints. However that does not mean you can ignore bias. Recognising the types of bias, and understanding their impact on your conclusions will make you a better Data Scientist and improve the quality of your conclusions.

You can read more about how to do reproducible, testable Data Science that helps defend against bias in my book Guerrilla Analytics: A Practical Approach to Working with Data. Can you think of any other biases? Please get in touch!

References

  1. Data Scientist: Bias, Backlash and Brutal Self-Criticism, James Kobielus, MAY 16, 2013, http://www.ibmbigdatahub.com/blog/data-scientist-bias-backlash-and-brutal-self-criticism
  2. The Hidden Biases in Big Data, Kate Crawford APRIL 01, 2013, https://hbr.org/2013/04/the-hidden-biases-in-big-data
  3. 7 Common Biases That Skew Big Data Results, 9th July 2015 Lisa Morgan, http://www.informationweek.com/big-data/big-data-analytics/7-common-biases-that-skew-big-data-results/d/d-id/1321211
  4. Design of Experiments for the Tuning of Optimisation Algorithms, 2004, University of York, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.332.9333&rep=rep1&type=pdf

“You need an Algorithm, not a Data Scientist”. Um…not quite

formula-594149_1920

I recently read a Harvard Business Review (HBR) article [1] “You need an algorithm, not a Data Scientist”. Other articles present similar arguments [2] [3]. I disagree. Data Scientists and automation (data products, algorithms, production code, whatever) are complementary functions. What you actually need is a Data Scientist and then an algorithm.

Data Science supports automation

Good Data Science supports automation. It tells you:

  • what you didn’t already know about the data (profiles, errors, nuances, structure)
  • what an appropriate algorithm should be, given what you now know about the data
  • how your data should be prepared for that algorithm (removing correlations, scaling variables, deriving new variables)
  • what the measurable expectations of that algorithm should be when it is automated in production

Data Science and Automation are Complementary

The author (from an analytics vendor) makes the following points which I address below:

  • Companies are increasingly trying to do more analysis of their data to find value and are hiring people (data scientists) to do this work. This people-centric approach does not scale.
    • The point of Data Science is to be a service.  This service can quickly do agile experiments to quantify and investigate business hypotheses about data and help inform the roll out of products. Doing Data Science therefore informs the investment decision in software development, software purchase, software tuning, etc. It is never meant to scale up to replace automation.

  • Some patterns are too imperceptible to be captured by humans. The author gives the example of monitoring a slowly changing customer profile which would go unnoticed with a manual examination of the data. However algorithms can continuously monitor this data at scale and so are better.
    • This is partially true. Algorithms can certainly work day and night, quickly processing refreshed and streaming data better than any human could ever hope to. However, if the system being analysed is not well understood then appropriate analyses cannot be chosen and tuned before ‘switching on the fire hose’. It is this understanding, modelling, analysing and tuning that is the job of the Data Scientist in collaboration with the domain expert. The Data Scientist does this in part using statistical and machine learning algorithms.

  • Modern tools “require very little or no human intervention, zero integration time, and almost no need for service to re-tune the predictive model as dynamics change”.
    • The vast majority of time on a data project is spent understanding and cleaning the data. Be very sceptical of claims that automation software can simply be ‘turned on’ without the necessary understanding of the data and the problem domain. Data is just too varied.

The HBR article poses an interesting challenge. Are completely automated algorithms the future? Get in touch and let me know your thoughts.

Read more

You can read more about how to do agile Data Science that transfers from the ‘lab’ to the ‘production factory’ [4] in my book Guerrilla Analytics: A Practical Approach to Working with Data and get the latest news at http://guerrilla-analytics.net.

References

[1] You Need an Algorithm, not a Data Scientist, Harvard Business Review

[2] Data Science is Still White Hot, But Nothing Lasts Forever, Fortune

[3] Why You Don’t Need a Data Scientist, Ubiq

[4] To work with data, you need a lab and a factory. Redman, T.C., Sweeney, B., 2013. Harvard Business Review.

An executive’s guide to machine learning

Iron Man

Iron Man

McKinsey recently published at excellent guide to Machine Learning for Executives. In this post I categorise the key points that stood out from the perspective of establishing machine learning in an organisation. The key take away for me was that without leadership from the C Suite, machine learning will be limited to being a small part of existing operational processes.

What does it take to get started?

Strategy

  • C-level executives will make best use of machine learning if it is part of a strategic vision.
  • Not taking a strategic view of machine learning risks its being buried inside routine operations. While it may be a useful service, its long-term value will be limited to “cookie cutter” applications like retaining customers.
  • C Suite should make a commitment to:
    • investigate all feasible alternatives
    • pursue the strategy wholeheartedly at the C-suite level
    • acquire expertise and knowledge in the C-suite to guide the strategy.

People

  • Companies need two types of people to leverage machine learning.
    • “Quants” are technical experts in machine learning
    • “Translators” bridge the disciplines of data, machine learning, and decision making.

Data

  • Avoid departments hoarding information and politicising access to it.
  • A frequent concern for the C-suite when it embarks on the prediction stage is the quality of the data. That concern often paralyzes executives. Adding new data sources may be of marginal benefit compared with what can be done with existing warehouses and databases.

Quick Wins

  • Start small—look for low-hanging fruit to demonstrate successes. This will boost grassroots support and ultimately determine whether an organization can apply machine learning effectively.
  • Be tough on yourself. Evaluate machine learning results in the light of clearly identified criteria for success.

What does the future hold?

  • People will have to direct and guide the machine learning algorithms as they attempt to achieve the objectives they are given.
  • No matter what fresh insights machine learning unearths, only human managers can decide the essential questions regarding the company’s business problems.
  • Just as with people, algorithms will need to be regularly evaluated and refined by experienced experts with domain expertise.

You can read more in the original article here. You can also read a more general guide to building data science capability here.

Best Practices When Starting And Working On A Data Science Project

Best practice guidelines

Several interesting questions were asked recently on a Data Science social network. This post addresses the question

“What best practices do you recommend, when starting and working on enterprise analytics projects?”

I have worked as a Data Scientist for 8 years now. This was after completing a PhD on “Design of Experiments for Tuning Optimisation Algorithms”. So I have a formal background in rigorous experiment design for Data Science and have also managed some pretty complex and fast paced projects in sectors including Financial Services, IT, Insurance, Government and Audit.

This post summarises my thoughts on best practice that are heavily based on practical experience as described in my book “Guerrilla Analytics: A Practical Approach to Working with Data”. The book contains almost 100 best practice tips for doing Data Science in dynamic projects where reproducibillty, explainability and team efficiency are critical.

Here is a summary of the best practices for working on enterprise analytics projects.

  • Soft Skills Best Practice
    • Consult
      • Understand the business problem
      • Understand the stakeholders
      • Understand your STARS situation
    • Communicate
      • Explain what you are doing and why
      • Explain the caveats in interpreting what you are doing
      • Always focus on the business problem
      • Continuously validate the above
    • Budget and Plan
      • Clearly set out your approach, milestones and deliverables
      • Measure progress and adjust when going off track or moving in a new direction
  • Technical Skills Best Practice Using the 7 Guerrilla Analytics Principles
    • Operations
      1. Keep Everything (Principle 1: Space is cheap, confusion is expensive)
      2. Keep It Simple (Principle 2: Prefer simple, visual project structures and conventions)
      3. Automate (Principle 3: Prefer automation with program code )
      4. Maintain Data Provenance (Principle 4 Maintain a link between data on the file system, data in the analytics environment, and data in work products)
      5. Version Control Data and Code (Principle 5: Version control changes to data and analytics code)
      6. Consolidate (Principle 6: Consolidate team knowledge in version-controlled builds)
      7. Think like a developer (Principle 7: Prefer analytics code that runs from start to finish)
    • Testing
      • Test data with the 5 Cs of Guerrilla Analytics Data Quality
      • Test code. Take a risk based approach to small, medium and large tests to improve confidence in the correctness of data manipulations, data cleaning and the application of business rules.
      • Test models. Always reviews the standard tests that accompany a model or algorithm. Run models against new data to make sure they perform.

The rest of this post will look at these best practice guidelines in more detail.

Soft Skills Best Practice

Consult

startup-593341_1920

Whether your job title is ‘consultant’ or not, the fact is you are probably acting as as consultant to some degree. See Peter Block’s excellent book “Flawless Consulting: A Guide to Getting Your Expertise Used” on the Data Science Reading List for more information. Recognise that you are in a position where you need to influence stakeholders to use your insights and take action. That means you need to:

  • Understand the business problem. Ask questions, take notes, play back your understanding until you have the best understanding possible of what the real problem is, its drivers, its blockers and what a successful outcome looks like for all stakeholders. Too much data science fails because it produces the right answer to the wrong problem.
  • Understand the stakeholders. Who is asking you to solve this problem? Why? Who is sponsoring the project and what are their concerns, drivers, targets?
  • Understand your STARS situation. Are you in a situation that is a Start-up, Turnaround, Accelerated growth, Realignment, Sustaining success? Each of these requires a different approach from fast-action heroics at one end of the spectrum to carefully planned maintenance and improvement at the other. You can read more about these ideas in the excellent book “The First 90 Days: Critical Success Strategies for New Leaders at All Levels”.

Communicate

cooperate-437511_1920

For your work to be successful, you need to be able to communicate what it is you have done and why your audience should care. This applies regardless of the level you are operating at.

This may be disappointing for a budding Data Scientist but your most sophisticated and clever work will only be appreciated in the context of your ability to consult as above. A ‘decrease in Type 1 error’, a ‘better Gini’, a responsive beautiful visualization are only of value when cast in terms that address the business problem in ways your stakeholders care about and can understand. Your manager who trusts your abilities may not need to know the minutiae of your workings but rather that you have taken a sensible approach and that you are clear on its limitations and any inherent risks/caveats. When communicating,

  • be able to summarise your key insights on a page
  • keep the technical details to an appendix. Your objective is rarely to impress with technicalities. Instead it is to deliver insight that leads to action.
  • be able to visualize your insights with a story that engages your audience

Budget and Plan

notes-514998_1920

Data Science projects are notoriously difficult to budget and plan because they are typically of an exploratory nature. Many can run indefinitely when poorly managed teams exhaustively mine data sets. There are almost unlimited ways to cut data and present it. This does not let you off the hook however. Your stakeholders and time with business SMEs is limited (they have day jobs). Your own time is limited. When running your project,

  • set goals with timelines
  • measure and track progress and adjust when necessary
  • avoid the temptation to do something because it’s cool or fun before validating it with your stakeholders

Win-Vector gives a really excellent post on how to set expectations in Data Science projects.

Technical Best Practice with Guerrilla Analytics

code-820275_1920

Of course, Data Science is a technical discipline. There are 7 best practice guidelines called the Guerrilla Analytics Principles that will help keep everything running smoothly despite the very dynamic situations you are faced with. These principles apply across the entire analytics lifecycle from data extraction through to reporting.

Operations

Keep Everything (Principle 1: Space is cheap, confusion is expensive)

apples-346772_1920

Having a record of everything makes it easier when Data Science work needs to explain what was done, how understanding evolved, and why there may be errors or caveats around interpretation of results. Best practice tips include:

  • Keep all the data your receive, even older broken versions.
  • Keep all modifications to the data.
  • Keep all communications about the data (meetings, notes, dictionaries).
  • Keep all work products you create, even when they are superseded or replaced because they were wrong. This avoids confusion and conflicting results in what is a highly iterative environment.

Keep It Simple (Principle 2: Prefer simple, visual project structures and conventions)

Simple project and team conventions are easier to remember and therefore easier to follow. The more your team can look at a project structure and understand the purpose of the structure then the less time they waste and the lower the risk of inconsistencies. Some examples of best practice include:

  • Have one place for all data the team receives.
  • Take a consistent approach to loading data into the analytics environment.
  • Have one place for all work products the team produces.
  • Keep supporting materials (emails, documentation etc) near the data they support.

Automate (Principle 3: Prefer automation with program code)

workflow

It’s inevitable that your project will involve multiple code files, perhaps from multiple tools, that must be repeatedly run in the correct order by multiple team members. Some of my projects have had several hundred SQL files for preparing data. When you develop a machine learning or statistical model, it is typically an iterative process that involves deriving new variables, re-running algorithms and creating outputs and visualizations to test the model performance.

  • Write code that is modular so parts of an analysis can be re-run and tested
  • Write code that can be run from the command line to facilitate simple automation.
  • Use a script for automating your code or use something a little more sophisticated like a build tool.

Maintain Data Provenance (Principle 4 Maintain a link between data on the file system, data in the analytics environment, and data in work products)

factory-35108

Data Provenance simply means knowing where your data came from, how it was changed, who changed it, what was delivered and to whom it was delivered. If you can maintain this link from team inputs through to team outputs then your data science project will be much easier to manage, risks of inconsistency and incorrectness will be mitigated and team efficiency will be promoted. Some tips that help maintain data provenance include:

  • Give all received data its own unique identifier
  • Give all work products issued by the team their own unique identifier
  • Keep simple logs to record data receipt and work product creation

Version Control Data and Code (Principle 5: Version control changes to data and analytics code)

growing-109229_1920

With so much going on in the project and with the typically iterative nature of Data Science, it becomes imperative to use version control. You will need to go back to previous versions of code, branch a copy of current work to test a theory and undo mistakes. At a minimum you should:

  • version control data received
  • version control program code used to produce analytics
  • version control work products that are issued by the team

With this in place, any work product can be identified as say ‘version 3 of the work product, using version 3 of my code which draws on version 2 of the data received’.

Consolidate (Principle 6: Consolidate team knowledge in version-controlled builds)

It is inevitable that data cleaning rules, business rules and lessons learned will emerge over the course of a project. If each team member is applying these individually then the team is not performing efficiently and there is a risk of inconsistency. Take even a simple ranking operation. Do we mean a dense rank (1223) or something else (1224, 1334 etc)?

  • Identify the latest true version of data received and publish that centrally to the team
  • Identify common data manipulations, data cleaning and business rules. Implement them centrally and publish them to the team

Think like a developer (Principle 7: Prefer analytics code that runs from start to finish)

Data Science is not software development although both use program code. Data Science typically involves a lot of profiling of data to understand its properties. Models and rules are trialled to see if they perform well on the data. Cleaning rules are continuously discovered as understanding of the data grows.

This inevitably leads to many ‘code snippets’ that are necessary to developing an understanding of the data but not required for the final work product. These code snippets usually break over the course of the project, clog up work product code and confuse team members and reviewers when it comes to reproducing a result.

Writing analytics code that executes end-to-end eliminates these types of bugs and ‘code noise’. Team efficiency is improved, reproducibility of results is guaranteed and risk of data loss is eliminated since deleted data sets can easily be restored with a quick code execution.

Testing

software-762486_1920

So much of testing is overlooked in Data Science. Its importance lead me to devote 4 chapters to Data Science Testing in “Guerrilla Analytics: A Practical Approach to Working with Data”.

At a minimum, high performing teams are doing some amount of testing. The testing falls into three areas. But before discussing those, pay attention to the overarching testing best practices.

  • Take a risk-based approach. There is not time to test everything exhaustively. Make sure that the most critical aspects of the data science work are being tested first.
  • Test early and often. Do not be tempted to put off testing until later in your project. Many of the models you build and data transformations you code may have to change if flaws are discovered in the data or in code.
  • Automate testing. To facilitate testing often, it helps to have some form of automation of test scripts so that tests can be easily repeated and you immediately know when something has gone wrong.

Test Data with the 5 Cs of Data Quality

test-214185_1280

The excellent Bad Data Handbook lists 4 data test categories which I extend to 5 for “Guerrilla Analytics: A Practical Approach to Working with Data”. The 5 Cs of Guerrilla Analytics data quality are:

  1. Completeness: Do you actually have all the data you expect to have?
  2. Correctness: Does the data actually reflect the business rules and domain knowledge you expect it to reflect?
  3. Consistency: Are refreshes of the data consistent and is the data consistent when it is viewed over some time period?
  4. Coherence: Does the data “fit together” in terms of its expected relationships?
  5. aCcountability: Can you trace the data to tell where the data came from, who delivered it, where it is stored in the DME, and other information useful for its traceability?

Test Code

Testing of analytics code typically covers the data preparation and manipulation code that leads up to a visualization or application of an algorithm. Testing is a well established field in software engineering but not so in Data Science. “How Google Tests Software” is a great introduction and identifies 3 types of tests. In the context of data science, these are

  • Small: tests that wrap around small units of code, usually units that contain some particularly complex data manipulation or cleaning rules. Some example include application of regular expressions, calculations of running totals, identification of duplicates.
  • Medium: tests that wrap around a ‘component’ of multiple units. This might include the joining up and end-to-end cleaning of a particular data segment such as ‘customer’ or ‘products’.
  • Large: tests that wrap around the entire end-to-end project. For example, does the output of the machine learning model still contain the expected number of customers or where customers accidentally dropped somewhere between raw data and algorithm output?

In addition, some best practice tips for facilitating testing include:

  • Write modular code, typically having one code file per data step. This makes it easier to perform small and medium tests and makes it easier for several team members to work simultaneously in the code base without blocking one another.
  • Use a common structure for test code. All test code should have a setup, test and tear-down structure. This makes it easier to automate and debug tests.

Test Models

distribution-159626

When testing statistical models and machine learning outputs, things understandably a little more domain specific. At a minimum, best practice involves:

  • Running model tests. For example, tests of regression include heteroscedasticity, normality, correlation, and leverage.
  • Using cross-validation. Build and test models on a variety of partitions of the data to help minimise overfitting
  • Testing model predictions. See how the model performs on data it has not encountered before.

Summary

There is a lot to be mindful of in terms of best practices when doing Data Science. The key insight from “Guerrilla Analytics: A Practical Approach to Working with Data” is that 7 Guerrilla Analytics Principles can mitigate the operational risks inherent in Data Science projects. You can read more about these risks, the Guerrilla Analytics Principles and see almost 100 examples of their practical application in the book “Guerrilla Analytics: A Practical Approach to Working with Data”.