Big Data Debate: The Controversial Questions at Google campus

I was recently invited to take part on the panel at the Big Data Debate (@bigdatadebate) at Google’s campus near Old Street in London1.

Big Data Debate 2

It was a great opportunity to meet like minded folks such as Christian Prokopp @prokopp Rangespan, Paul Bradshaw @paulbradshaw, Duncan Ross @duncan3ross Teradata, Daniel Hulme Satalia, Michael Cutler @cotdp TUMRA, Andy Piper @andypiper Pivotal and Will Scott Moncrieff  from DueDil. Overall it was an interesting debate with some interesting contributions from the panel and the packed house.

Big Data Debate 3

We spent perhaps half of the panel hour and most of the audience questions on data privacy. I guess this is revealing in itself if such concerns are at the forefronts of the public’s mind as opposed to the opportunities presented by data analytics.

Christian did start one controversial question with me. Paraphrasing, it was around the dangers that arise when we have the potential to mine vast quantities of data looking for patterns. My answer, as it has been since my PhD days is that this is simply poor methodology *whatever* the volumes of data you are analysing. A data science methodology should allow us to answer questions (test hypotheses) about a problem (as described by data) while reducing bias as far as possible. Think about that. If you go trawling for an effect that you expect to exist in data you will eventually find it. Instead, your approach should be:

  • understand the problem (talk to the business, formulate a scientific theory)
  • turn the problem into hypotheses (our campaign increased sales, a fraudulent user has a log pattern that is different from his peers etc)
  • decide what effect is practically significant
  • then you go and apply an appropriate statistical test with the correct sample size, and power. You check the test’s assumptions.
  • when you don’t find what you were looking for, you can’t keep changing your effect sizes and revisiting the data! That’s cherry picking or a confirmation bias.

So I suppose the answer to Christian’s question is a ‘yes’ but it has nothing to do with ‘Big Data’. Big Data is dangerous because new tools and hype can lead to folks forgetting that garbage in results in garbage out. You have to understand the data and the rigorous analysis you are applying – just like any scientist.

Here are some recent good reads:

[1] I am employed by KPMG, one of the event sponsors

Guerrilla Analytics – the book! Book contract signed for Autumn 2014

Great news! I will be publishing a book on Guerrilla Analytics with Morgan Kaufmann in Autumn 2014. After lots of proposal crafting and contract negotiations the contracts have finally been signed and I can begin work. It will be about 90,000 words on Guerrilla Analytics covering topics such as:

  • what is data analytics and where does guerrilla analytics fit within that?
  • the principles of guerrilla analytics
  • worked examples at each stage of the data analytics workflow from data extraction and receipt through to delivery of work products. All of these examples will be supported by practice tips, case studies and war stories. This will be a real practitioners book that will help you survive real analytics projects in fast paced dynamic environments

You’ll find this book useful if you are:

  • a Senior Manager and you want to know that you have the right team and technology in place to deliver reproducible, tested analytics that stand up to audit and scrutiny and can be handed over easily when resources roll off your project
  • an analytics Manager who has several reports. You do want your team to be independent and agile without having to micro manage their work. You want to keep it simple so that everybody on the team can maintain data provenance and understand one another’s work without repeated inefficient hand-overs and explanations
  • a data analyst who wants to do high quality work, interact in a team but not be burdened with unnecessary process and team rules.

I’m looking forward to getting started! Stay tuned for more updates and some snippets of the book as it evolves.

Guerrilla Analytics talk at Enterprise Data World, San Diego 2013

@edwardacurry and I did a talk at Enterprise Data World 2013 in sunny San Diego. The slides are below. In this longer talk we were able to take the audience through some worked examples to illustrate how guerrilla analytics is applied in practice. Feedback was positive. There was plenty of empathy from audience members with teams that struggling with the challenges that Guerrilla Analytics addresses.

Speaker Spotlight at Enterprise Data World 2013

My interview in preparation for Enterprise Data World 2013 has just been published. The interview is pretty succinct – some opinions on recent trends and the influence of Big Data on the work I do. Looking forward to San Diego and Enterprise Data World!

Guerrilla Analytics at the Business Intelligence Congress, Orlando December 2013

@edcurry and I recently presented on Guerrilla Analytics at the Business Intelligence Congress in Orlando, Florida. The slides are here. These are some early thoughts on Guerrilla Analytics, what it is and the principles involved.