Big Data Debate: The Controversial Questions at Google campus

I was recently invited to take part on the panel at the Big Data Debate (@bigdatadebate) at Google’s campus near Old Street in London1.

Big Data Debate 2

It was a great opportunity to meet like minded folks such as Christian Prokopp @prokopp Rangespan, Paul Bradshaw @paulbradshaw, Duncan Ross @duncan3ross Teradata, Daniel Hulme Satalia, Michael Cutler @cotdp TUMRA, Andy Piper @andypiper Pivotal and Will Scott Moncrieff  from DueDil. Overall it was an interesting debate with some interesting contributions from the panel and the packed house.

Big Data Debate 3

We spent perhaps half of the panel hour and most of the audience questions on data privacy. I guess this is revealing in itself if such concerns are at the forefronts of the public’s mind as opposed to the opportunities presented by data analytics.

Christian did start one controversial question with me. Paraphrasing, it was around the dangers that arise when we have the potential to mine vast quantities of data looking for patterns. My answer, as it has been since my PhD days is that this is simply poor methodology *whatever* the volumes of data you are analysing. A data science methodology should allow us to answer questions (test hypotheses) about a problem (as described by data) while reducing bias as far as possible. Think about that. If you go trawling for an effect that you expect to exist in data you will eventually find it. Instead, your approach should be:

  • understand the problem (talk to the business, formulate a scientific theory)
  • turn the problem into hypotheses (our campaign increased sales, a fraudulent user has a log pattern that is different from his peers etc)
  • decide what effect is practically significant
  • then you go and apply an appropriate statistical test with the correct sample size, and power. You check the test’s assumptions.
  • when you don’t find what you were looking for, you can’t keep changing your effect sizes and revisiting the data! That’s cherry picking or a confirmation bias.

So I suppose the answer to Christian’s question is a ‘yes’ but it has nothing to do with ‘Big Data’. Big Data is dangerous because new tools and hype can lead to folks forgetting that garbage in results in garbage out. You have to understand the data and the rigorous analysis you are applying – just like any scientist.

Here are some recent good reads:

[1] I am employed by KPMG, one of the event sponsors