A Definition of Data Science for Business

Vague mixes of skill sets. A focus on activities and technology. Bizarre Venn diagrams. There is huge confusion over what Data Science is. Is it Big Data? Isn’t it statistics? Is it something else entirely? This confusion leads to vendor and recruiter hype. It leads to inflated career expectations. It leads to rebranding of solid, established and much-needed fields like Analytics, Business Intelligence and Statistics. The secret to defining data science is to focus on the science

Vague mixes of skill sets. A focus on activities and technology. Bizarre Venn diagrams. There is huge confusion over what Data Science is. Is it Big Data? Isn’t it statistics? Is it something else entirely? This confusion leads to vendor and recruiter hype. It leads to inflated career expectations. It leads to rebranding of solid, established and much-needed fields like Analytics, Business Intelligence and Statistics. The secret to defining data science is to focus on the science

Continue reading “A Definition of Data Science for Business”

Data Science – A Definition And How To Get Started

Confusion, hype, failure to start. Data Science has huge potential to change an organisation. But many organisations become mired in the associated cultural, technological and people change. Data Science is delivered as an interesting report rather than a driver of change. Data Science identifies algorithms that are run in the safety of a lab but never make it into production.

This is my keynote talk from the Polish Data Science with Business Conference.

Confusion, hype, failure to start. Data Science has huge potential to change an organisation. But many organisations become mired in the associated cultural, technological and people change. Data Science is delivered as an interesting report rather than a driver of change. Data Science identifies algorithms that are run in the safety of a lab but never make it into production.

This is my keynote talk from the Polish Data Science with Business Conference.

Continue reading “Data Science – A Definition And How To Get Started”

Irish Language Data Science lecture at Engineers Ireland

I gave the following lecture to Engineers Ireland which is the professional body for Engineers. It’s about “Data Science and the benefits for engineering” and is entirely in Irish.

It was an interesting exercise to brush up on my Gaelic and also to see the wealth or resources that now exist for using Irish with modern technical vocabulary. If you are curious or are trying to get your Gaelic up to scratch then please get in touch!

I may be the first person to coin Data Science in Gaeilge!

I gave the following lecture to Engineers Ireland which is the Irish professional body for Engineers. The lecture is about “Data Science and the benefits for engineering” and is entirely in Irish.

It was an interesting exercise to brush up on my Gaelic and also to see the wealth of resources that now exist for using Irish with modern technical vocabulary. If you are curious or are trying to get your Gaelic up to scratch then please get in touch!

In terms of content, it covers what data and data science look like and how traditional engineering problems might benefit from the application of data science.

The full video is linked below.

And here are the slides 2016-04 Engineering Ireland_04_Gaeilge.

Data Science Patterns: Preparing Data for Agile Data Science

Are you a data scientist working on a project with constantly changing requirements, flawed changing data and other disruptions? Guerrilla Analytics can help.

The key to a high performing Guerrilla Analytics team is its ability to recognise common data preparation patterns and quickly implement them in flexible, defensive data sets.

After this webinar, you’ll be able to get your team off the ground fast and begin demonstrating value to your stakeholders.

You will learn about:
* Guerrilla Analytics: a brief introduction to what it is and why you need it for your agile data science ambitions
* Data Science Patterns: what they are and how they enable agile data science
* Case study: a walk through of some common patterns in use inreal projects

I recently gave a webinar on Data Science Patterns. The slides are here.

Data Science Patterns, as with Software Engineering Patterns, are ‘common solutions to recurring problems’. I was inspired to put this webinar together based on a few things.

  • I build Data Science teams. Repeatedly, you find teams working inconsistently in terms of the data preparation approaches, structures and conventions they use. Patterns help resolve this problem. Without patterns, you end up with code maintenance challenges, difficulty in supporting junior team members and all round team inefficiency due to having a completely ad-hoc approach to data preparation.
  • I read a recent paper ‘Tidy Data’ by Hadley Wickham in the Journal of Statistical Software http://vita.had.co.nz/papers/tidy-data.pdf. This paper gives an excellent clear description of what ‘tidy data’ is – the data format used by most Data Science algorithms and visualizations. While there isn’t anything new here if you have a computer science background, Wickham’s paper is an easy read and has some really clear worked examples.
  • My book, Guerrilla Analytics (here for USA or here for UK), has an entire appendix on data manipulation patterns and I wanted to share some of that thinking with the Data Science community.

I hope you enjoy the webinar and find it useful. You can hear the recording here. Do get in touch with your thoughts and comments as I think Data Science patterns is a huge area of potential improvement in the maturity of Data Science as a field.

How Do I Avoid Bias In My Data Science Work?

The danger of bias hasn’t been given enough consideration in Data Science. Bias is anything that would cause us to skew our conclusions and not treat results and evidence objectively. Bias is sometimes unavoidable, sometimes accidental and unfortunately sometimes deliberate. While bias is well recognised as a danger in mainstream science, I think Data Science could benefit from improving in this area. In this post I categorise the types of bias encountered in typical Data Science work. I have gathered these from recent blog posts [1], [2], [3] and a discussion in my PhD thesis [4]. I also show how to reduce bias using some of the principles you can learn about in Guerrilla Analytics: A Practical Approach to Working with Data.

Comparing apples and bananas on a scale

The danger of bias hasn’t been given enough consideration in Data Science. Bias is anything that would cause us to skew our conclusions and not treat results and evidence objectively. It is sometimes unavoidable, sometimes accidental and unfortunately sometimes deliberate. While bias is well recognised as a danger in mainstream science, I think Data Science could benefit from improving in this area.

In this post I categorise the types of bias encountered in typical Data Science work. I have gathered these from recent blog posts [1], [2], [3] and a discussion in my PhD thesis [4]. I also show how to reduce bias using some of the principles you can learn about in Guerrilla Analytics: A Practical Approach to Working with Data.

8 Types of Bias in Data Science

The first step is to be aware of the types of bias you may encounter.

  1. Confirmation bias. People are less critical of Data Science that supports their prior beliefs rather than challenges their convictions.
    • This happens when results that go against the grain are rejected in favour of results that promote ‘business as usual’. Was the latest quarterly marketing campaign really successful across the board or just for one part of the division?
  2. Rescue bias. This bias involves selectively finding faults in an experiment that contradicts expectations. It is generally a deliberate attempt to evade and undermine evidence.
    • You may fall for this bias when your project results are disappointing. Perhaps your algorithm can’t classify well enough. Perhaps the data is too sparse. The Data Scientist tries to imply that results would have been different had the experiment been different. Doing this is effectively drawing conclusions without data and without experiments.
  3. ‘Time will tell’ bias. Taking time to gather more evidence should increase our confidence in a result. This bias affects the amount of such evidence that is deemed necessary to accept the results.
    • You may encounter this bias when a project is under pressure to plough ahead rather than waiting for more data and more confident Data Science. Should you draw conclusions based on one store or wait until you have more data from a wide variety of stores and several seasons?
  4. Orientation bias. This reflects a phenomenon of experimental and recording error being in the direction that supports the hypothesis.
    • You may encounter this bias when your work is needed to support a business decision that has already been made. This arises in the pharmaceuticals industry, for example, where trials favour the new pharmaceutical drugs.
  5. Cognitive bias: This is the tendency to make skewed decisions based on pre-existing factors rather than on the data and other hard evidence.
    • This might be encountered where the Data Scientist has to argue against a ‘hunch’ from ‘experience’ that is not supported by hard data.
  6. Selection bias: This is the tendency to skew your choice of data sources to those that may be most available, convenient and cost-effective for your purposes.
    • You will encounter this bias when you have to ‘demonstrate value’ on a project that has not been properly planned. The temptation is to do ‘best endeavors’ with the data available.
  7. Sampling bias: This is the tendency to skew the sampling of data sets toward subgroups of the population.
    • An oft-quoted example here is the use of Twitter data to make broad inferences about the population. It turns out that the Twitter users sample is biased towards certain locations, certain incomes and education levels etc.
  8. Modelling bias: This is the tendency to skew Data Science models by starting with a biased set of assumptions about the problem. This leads to selection of the wrong variables, the wrong data, the wrong algorithms and the wrong metrics.

Reducing Bias

So what can you do to counter these biases in your work?

The first step is awareness and hopefully the above list will help you and your colleagues. If you know about bias, you can remain alert to it in your own work and that of others. Be critical and always challenge assumptions and designs.

The next best thing is to do what scientists do and make your work as reproducible and transparent as possible.

  • Track your data sources and profile your raw data as much as possible. Look at direct metrics from your data such as distributions and ranges. But also look at the qualitative information about the data. Where did it come from? How representative is this?
  • Make sure your data transformations and their influence on your populations can be clearly summarised. Are you filtering data? Why and so what? How are you calculating your variables and have you evaluated alternatives? Where is the evidence for your decision?
  • Track all your work products and data understanding as they evolve with the project. This allows you to look back at the exploration routes you discarded or didn’t have time to pursue.

Conclusion

Bias is sometimes unavoidable because of funding, politics or resources constraints. However that does not mean you can ignore bias. Recognising the types of bias, and understanding their impact on your conclusions will make you a better Data Scientist and improve the quality of your conclusions.

You can read more about how to do reproducible, testable Data Science that helps defend against bias in my book Guerrilla Analytics: A Practical Approach to Working with Data. Can you think of any other biases? Please get in touch!

References

  1. Data Scientist: Bias, Backlash and Brutal Self-Criticism, James Kobielus, MAY 16, 2013, http://www.ibmbigdatahub.com/blog/data-scientist-bias-backlash-and-brutal-self-criticism
  2. The Hidden Biases in Big Data, Kate Crawford APRIL 01, 2013, https://hbr.org/2013/04/the-hidden-biases-in-big-data
  3. 7 Common Biases That Skew Big Data Results, 9th July 2015 Lisa Morgan, http://www.informationweek.com/big-data/big-data-analytics/7-common-biases-that-skew-big-data-results/d/d-id/1321211
  4. Design of Experiments for the Tuning of Optimisation Algorithms, 2004, University of York, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.332.9333&rep=rep1&type=pdf

“You need an Algorithm, not a Data Scientist”. Um…not quite

I recently read a Harvard Business Review (HBR) article [1] “You need an algorithm, not a Data Scientist”. Other articles present similar arguments [2] [3]. I disagree. Data Scientists and automation (data products, algorithms, production code, whatever) are complementary functions. What you actually need is a Data Scientist and then an algorithm.

formula-594149_1920

I recently read a Harvard Business Review (HBR) article [1] “You need an algorithm, not a Data Scientist”. Other articles present similar arguments [2] [3]. I disagree. Data Scientists and automation (data products, algorithms, production code, whatever) are complementary functions. What you actually need is a Data Scientist and then an algorithm.

Data Science supports automation

Good Data Science supports automation. It tells you:

  • what you didn’t already know about the data (profiles, errors, nuances, structure)
  • what an appropriate algorithm should be, given what you now know about the data
  • how your data should be prepared for that algorithm (removing correlations, scaling variables, deriving new variables)
  • what the measurable expectations of that algorithm should be when it is automated in production

Data Science and Automation are Complementary

The author (from an analytics vendor) makes the following points which I address below:

  • Companies are increasingly trying to do more analysis of their data to find value and are hiring people (data scientists) to do this work. This people-centric approach does not scale.
    • The point of Data Science is to be a service.  This service can quickly do agile experiments to quantify and investigate business hypotheses about data and help inform the roll out of products. Doing Data Science therefore informs the investment decision in software development, software purchase, software tuning, etc. It is never meant to scale up to replace automation.

  • Some patterns are too imperceptible to be captured by humans. The author gives the example of monitoring a slowly changing customer profile which would go unnoticed with a manual examination of the data. However algorithms can continuously monitor this data at scale and so are better.
    • This is partially true. Algorithms can certainly work day and night, quickly processing refreshed and streaming data better than any human could ever hope to. However, if the system being analysed is not well understood then appropriate analyses cannot be chosen and tuned before ‘switching on the fire hose’. It is this understanding, modelling, analysing and tuning that is the job of the Data Scientist in collaboration with the domain expert. The Data Scientist does this in part using statistical and machine learning algorithms.

  • Modern tools “require very little or no human intervention, zero integration time, and almost no need for service to re-tune the predictive model as dynamics change”.
    • The vast majority of time on a data project is spent understanding and cleaning the data. Be very sceptical of claims that automation software can simply be ‘turned on’ without the necessary understanding of the data and the problem domain. Data is just too varied.

The HBR article poses an interesting challenge. Are completely automated algorithms the future? Get in touch and let me know your thoughts.

Read more

You can read more about how to do agile Data Science that transfers from the ‘lab’ to the ‘production factory’ [4] in my book Guerrilla Analytics: A Practical Approach to Working with Data and get the latest news at http://guerrilla-analytics.net.

References

[1] You Need an Algorithm, not a Data Scientist, Harvard Business Review

[2] Data Science is Still White Hot, But Nothing Lasts Forever, Fortune

[3] Why You Don’t Need a Data Scientist, Ubiq

[4] To work with data, you need a lab and a factory. Redman, T.C., Sweeney, B., 2013. Harvard Business Review.

An executive’s guide to machine learning

McKinsey recently published at excellent guide to Machine Learning for Executives. In this post I categorise the key points that stood out from the perspective of establishing machine learning in an organisation. The key take away for me was that without leadership from the C Suite, machine learning will be limited to being a small part of existing operational processes.

Iron Man

McKinsey recently published at excellent guide to Machine Learning for Executives. In this post I categorise the key points that stood out from the perspective of establishing machine learning in an organisation. The key take away for me was that without leadership from the C Suite, machine learning will be limited to being a small part of existing operational processes.

What does it take to get started?

Strategy

  • C-level executives will make best use of machine learning if it is part of a strategic vision.
  • Not taking a strategic view of machine learning risks its being buried inside routine operations. While it may be a useful service, its long-term value will be limited to “cookie cutter” applications like retaining customers.
  • C Suite should make a commitment to:
    • investigate all feasible alternatives
    • pursue the strategy wholeheartedly at the C-suite level
    • acquire expertise and knowledge in the C-suite to guide the strategy.

People

  • Companies need two types of people to leverage machine learning.
    • “Quants” are technical experts in machine learning
    • “Translators” bridge the disciplines of data, machine learning, and decision making.

Data

  • Avoid departments hoarding information and politicising access to it.
  • A frequent concern for the C-suite when it embarks on the prediction stage is the quality of the data. That concern often paralyzes executives. Adding new data sources may be of marginal benefit compared with what can be done with existing warehouses and databases.

Quick Wins

  • Start small—look for low-hanging fruit to demonstrate successes. This will boost grassroots support and ultimately determine whether an organization can apply machine learning effectively.
  • Be tough on yourself. Evaluate machine learning results in the light of clearly identified criteria for success.

What does the future hold?

  • People will have to direct and guide the machine learning algorithms as they attempt to achieve the objectives they are given.
  • No matter what fresh insights machine learning unearths, only human managers can decide the essential questions regarding the company’s business problems.
  • Just as with people, algorithms will need to be regularly evaluated and refined by experienced experts with domain expertise.

You can read more in the original article here. You can also read a more general guide to building data science capability here.