I may be the first person to coin Data Science inGaeilge!
I gave the following lecture to Engineers Ireland which is the Irish professional body for Engineers. The lecture is about “Data Science and the benefits for engineering” and is entirely in Irish.
It was an interesting exercise to brush up on my Gaelic and also to see the wealth of resources that now exist for using Irish with modern technical vocabulary. If you are curious or are trying to get your Gaelic up to scratch then please get in touch!
In terms of content, it covers what data and data science look like and how traditional engineering problems might benefit from the application of data science.
My talk covered how the 7 Guerrilla Analytics Principles are the foundation for doing Agile Data Science. With a Data Science Operating Model that follows these principles, your team always know where their data came from, who changed it and why and can explain any of the highly iterative explorations and analyses their customers require.
You can find the slides below and at Slideshare. As always, feedback and questions are welcome. Enjoy!
Data Science Patterns, as with Software Engineering Patterns, are ‘common solutions to recurring problems’. I was inspired to put this webinar together based on a few things.
I build Data Science teams. Repeatedly, you find teams working inconsistently in terms of the data preparation approaches, structures and conventions they use. Patterns help resolve this problem. Without patterns, you end up with code maintenance challenges, difficulty in supporting junior team members and all round team inefficiency due to having a completely ad-hoc approach to data preparation.
I read a recent paper ‘Tidy Data’ by Hadley Wickham in the Journal of Statistical Software http://vita.had.co.nz/papers/tidy-data.pdf. This paper gives an excellent clear description of what ‘tidy data’ is – the data format used by most Data Science algorithms and visualizations. While there isn’t anything new here if you have a computer science background, Wickham’s paper is an easy read and has some really clear worked examples.
My book, Guerrilla Analytics (here for USA or here for UK), has an entire appendix on data manipulation patterns and I wanted to share some of that thinking with the Data Science community.
I hope you enjoy the webinar and find it useful. You can hear the recording here. Do get in touch with your thoughts and comments as I think Data Science patterns is a huge area of potential improvement in the maturity of Data Science as a field.
Bhain an léacht le cén chaoi is féidir leis na prionsabail agus noda Guerrilla Analytics cabhair leat agus tú ag déanamh Data Science i ndálaí atá dinimic, srianta ach fós tá sé riachtanach go bhfuil inrianaitheacht ann.
Bhí mórán ceisteanna tar éis an léacht agus is maith an rud é. Freagróidh mé iad i mblag eile agus bígí cinnte mé a leanacht ag @enda_ridge don scéal is déanaí.
Seo h-iad roinnt de na ceisteanna.
céard iad na scilleanna is tábhachtaí agus tú ag iarraidh obair mar data scientist?
an gá duit bheith in ann ríomhchlárú le h-aghaidh obair mar data scientist?
an bhfuil baint ann idir na prionsabail agus tionscadail taighde?
an bhfuil baint ann idir na prionsabail agus ‘Big Data’?
I recently had the opportunity to present a webinar on ‘Building Guerrilla Analytics Teams’ as part of the BrightTalk ‘Business Intelligence and Analytics’ series. You can access the full recorded webinar and slides here and the slides are embedded below.
Some really interesting questions came up at the end of the session. I’ve listed them here and will pick them up in subsequent blog posts.
How do you build a business case to resource and set up a data science team?
What is the number one tip for someone putting together a completely new data science team?
What role is most important when setting up a data science team?
What are the typical challenges faced when setting up a Guerrilla Analytics team?
I was recently invited to give a talk introducing Guerrilla Analytics and the principles described in the book. The talk covers some examples of how these principles are applied. It concludes by identifying some key research and development areas for doing this type of analytics in real-world projects.
This was a great opportunity to engage with a cross-disciplinary audience including computer scientists, computational biologists and engineers and to have a sounding board for some of the key research and development areas I think need to be addressed to enable practical data science work.
A key take-away for me was the gap between the advanced data science being studied in academia and the lack of simple, practical methodologies that hold back the implementation of this research.
We spent perhaps half of the panel hour and most of the audience questions on data privacy. I guess this is revealing in itself if such concerns are at the forefronts of the public’s mind as opposed to the opportunities presented by data analytics.
Christian did start one controversial question with me. Paraphrasing, it was around the dangers that arise when we have the potential to mine vast quantities of data looking for patterns. My answer, as it has been since my PhD days is that this is simply poor methodology *whatever* the volumes of data you are analysing. A data science methodology should allow us to answer questions (test hypotheses) about a problem (as described by data) while reducing bias as far as possible. Think about that. If you go trawling for an effect that you expect to exist in data you will eventually find it. Instead, your approach should be:
understand the problem (talk to the business, formulate a scientific theory)
turn the problem into hypotheses (our campaign increased sales, a fraudulent user has a log pattern that is different from his peers etc)
decide what effect is practically significant
then you go and apply an appropriate statistical test with the correct sample size, and power. You check the test’s assumptions.
when you don’t find what you were looking for, you can’t keep changing your effect sizes and revisiting the data! That’s cherry picking or a confirmation bias.
So I suppose the answer to Christian’s question is a ‘yes’ but it has nothing to do with ‘Big Data’. Big Data is dangerous because new tools and hype can lead to folks forgetting that garbage in results in garbage out. You have to understand the data and the rigorous analysis you are applying – just like any scientist.