I recently gave a webinar on Data Science Patterns. The slides are here.
Data Science Patterns, as with Software Engineering Patterns, are ‘common solutions to recurring problems’. I was inspired to put this webinar together based on a few things.
- I build Data Science teams. Repeatedly, you find teams working inconsistently in terms of the data preparation approaches, structures and conventions they use. Patterns help resolve this problem. Without patterns, you end up with code maintenance challenges, difficulty in supporting junior team members and all round team inefficiency due to having a completely ad-hoc approach to data preparation.
- I read a recent paper ‘Tidy Data’ by Hadley Wickham in the Journal of Statistical Software http://vita.had.co.nz/papers/tidy-data.pdf. This paper gives an excellent clear description of what ‘tidy data’ is – the data format used by most Data Science algorithms and visualizations. While there isn’t anything new here if you have a computer science background, Wickham’s paper is an easy read and has some really clear worked examples.
- My book, Guerrilla Analytics (here for USA or here for UK), has an entire appendix on data manipulation patterns and I wanted to share some of that thinking with the Data Science community.
I hope you enjoy the webinar and find it useful. You can hear the recording here. Do get in touch with your thoughts and comments as I think Data Science patterns is a huge area of potential improvement in the maturity of Data Science as a field.
2 thoughts on “Data Science Patterns: Preparing Data for Agile Data Science”
Interesting article. It appears to me that the above applies very nicely to Data Warehousing also, dealing with structured data.
It would be very useful to add unstructured data, that amounts to about 80% of all data. I would therefore suggest two crucial facets to tidying up data: 1) unlike in structured data, there is no 100% accuracy with unstructured data; hence the “level of confidence” has been a key factor in the unstructured world for decades. 2) a significant proportion of time is spent by the data scientist to collate data (both structured and unstructured) to ensure that the collated data is sufficient to provide Insights.
A classic example is the ANPR – Automatic Number Plate Recognition. One can never be 100% correct just by analysing the image/photograph of a car number plate. It is the subsequent cross referencing with structured that (possibly DVLA) that a “level of confidence” is assigned to the extracted number.
I therefore use two distinct words: “explicit” is the data that describes the collated data – it exists somewhere; “implicit” is something that we deduce/conclude from the cross-fertilization of explicit data – it does not exist but justifies the conclusion, the Insight.
Thanks Rana. Interesting perspective.