Programming language version 3.2. SQL, NoSQL, NewSQL. It seems that too often, the path to become a Data Scientist involves skills in vogue rather than more permanent competencies. In a fast paced field like Data Science, skills are more tangible. They can be directly tested. They can be dated to the latest technology or the latest language version.
Competencies are different. Competencies are a more general combination of skills, behaviours and knowledge.
You can have great Powerpoint skills creating beautiful slides but still be a terrible communicator. You can be skilled at Python syntax but still be a poor programmer. Communication and programming are competencies. It is competencies that are most important when you build a Data Science career that is robust to changing trends in skills like languages and technology platforms.
Here are some key competencies and example skills for successful Data Science.
Communication: data and data science are complex. You need to be able to really listen and understand the problem a customer wants solved. You need to be able to communicate your solution at the right level so your customer can take action.
Typical skills include report writing, presentation, speaking and story telling.
Data Modelling: you will encounter data from a variety of systems. You will need to organise your project data for flexible efficient use over the course of a project. This means being able to model data.
Typical skills include database design, SQL, normal forms, table design, indexing.
Data wrangling:you will need to reshape data so it can be visualized, profiled and made ready for algorithms.
Typical skills include data manipulation libraries like pandas, languages like Python or R and visualization libraries.
Programming and Tuning Algorithms: ultimately, you will produce an algorithm that captures your data science insights. Algorithms need to be tuned to data and their performance robustness quantified. There will be scenarios where the algorithm works well and scenarios where it does not. There will be efficient structures and structures that give the same results but with dramatically reduced efficiency.
Typical skills include a programming language, data structures, code testing, version control, complexity.
Data pipelining: once data is fairly well understood and candidate algorithms have been identified, you will begin iterating through data and algorithms to get to your final insights. This iteration is most effective when data can be torn down and rebuilt repeatably and quickly. This is a data build.
Typical skills include SQL, pipeline management libraries, ETL design.
Design of Experiments: Data Science is about applying the scientific method to understand data. Being able to design and execute an experiment is essential to being able to test an algorithm or model in the wild and demonstrate cause rather than correlation.
Typical skills include experiment layout, randomisation and blocking, statistical inference, and hypothesis testing.
Consulting: consulting is about influencing without power. Significant numbers of data science projects fail because data scientists could not convince their customers to change their business and use their data science. It may seem bizarre but like all aspects of change, data science has an impact on people, existing processes and existing technology.
Typical skills include meeting and workshop facilitation, stakeholder management and mapping and influencing.
Project Management: last but not least, if you cannot run a project (even as the sole data scientist on a project) then all the above is irrelevant. Project management is a hugely diverse and complex field. However there are some key skills that will help your data science succeed.
Typical skills include estimation, planning, budgeting, resourcing, and RAID management.
You can read more about the competencies and skills to become a Data Scientist in my book Guerrilla Analytics which you can read more about here (USA) and here (UK).
A/B tests! Machine learning! Deep learning! It’s easy to be distracted by new libraries and beautiful visualizations. It’s easy to waste time with scattergun approaches to data and algorithms. It’s easy to forget that as a data scientist you should be taking a scientific approach to understanding data.
Is scientific rigour appropriate in industry or should it be confined to academic research? Is it not better to to design an experiment instead of diving into the data? And should you care about reproducibility or move quickly to the next project?
[su_highlight background=”#d9dcd2″]Actually, the rigour of Science is essential for successful Data Science in business. [/su_highlight]The scientific method maps nicely to a sensible approach to data science projects in business. This post will show you how.
A definition of Science
Here is a definition of ‘science’ followed by what that means for data science in business.
Without getting into a whole field of philosophy, this ‘systematic’ scientific method generally involves the following steps.
Formulate a question
Formulate a hypothesis
Make a prediction
So how should data science apply this scientific method in business?
The Rigour of Science is Essential for Successful Data Science in Business
Here are the steps in the scientific method and a data science example.
The Business Objective
Formulation of a question. This is the most challenging and most important step for science in general and data science in business. This can be closed like ‘why are our sales decreasing?’ or can be open ended like ‘can we solve problem X?’
Example: our buying and marketing teams might engage data science to ask ‘can we predict what customers want to buy?’
Success in business: If you are not clear on your objective, you project is heading for trouble. Without a well formulated question, data projects become mired in complexity, go off course and fail.
The Business Case
Formulation of a hypothesis (conjecture) about a population. This is a testable conjecture that rejects the status quo (in science jargon, this is the null hypothesis).
Example: a hypothesis might be that a logistic regression applied to established customers (the population) will make better predictions than chance (or the current algorithm in use).
Success in business: this is your business case. This is the outcome that would cause the business to change the way they work.
Prediction. this is the logical consequence of the hypothesis. The more unlikely a prediction due to coincidence then the more likely the status quo should be rejected.
Example: we could predict an improvement in the number of correctly predicted purchases due to the new algorithm.
Success in business: this is an extension of the business case. If the business case were successful, then this is what we predict will happen.
Evaluating the Business Case
Testing with experiment. Here is where the prediction is tested in the real world. Experiment design is a field in its own right that I will blog about separately. For our customer algorithm, we might put it live on an online website or use some other means to get new predictions in front of real world customers.
Success in business: this is where we rigorously evaluate the business case. Experiment design is critical to counteract biases, random chance and external influences that we cannot control.
Analysing experiment results. This is where the data gathered from the experiment is analysed to determine if the status quo should be rejected. In the example, we would look for a significant difference in customer predictions using the new algorithm instead of the incumbent approach.
Success in business: this is where we rigorously decide where the business case will be a success. Because we are leveraging all the previous steps, our conclusion to reject the status quo and change the business can be done with confidence.
Rigour of Science drives good data science practices
In addition, the following scientific principles should be adhered to:
Repeatability. It should be possible to run an experiment again and get the same results and conclusions.
Success in business: this is how you have confidence that your results are generally applicable to the business. You can take that algorithm out of the lab and run it in production.
Reproducibility. It should be possible for somebody else to independently following your experiment steps and get the same results and conclusions
Success in business: many businesses are seasonal. Teams change, projects pause and are restarted. Reproducibility allows other teams to successfully inherit your work and use it with confidence.
Avoidance of bias. This can be human bias due to our inherent subjectivity but can also inadvertently be introduced through poor experiment design.
Success in business: This is how you stop those great results being extrapolated and interpolated in ways your never intended.
You can read more about how to bring scientific rigour to your data science in my book Guerrilla Analytics (UK) (USA). It contains simple principles for maintaining reproducibility and repeatability while delivering at pace.
If you would like to discuss further please get in touch!
Vague mixes of skill sets. A focus on activities and technology. Bizarre Venn diagrams. It seems there is huge confusion over what Data Science is. Is it Big Data? Isn’t it statistics? Is it something else entirely? This confusion causes untold problems. It leads to vendor and recruiter hype. It leads to inflated career expectations from those who work with data. It leads to rebranding of solid, established and much-needed fields like Analytic, Business Intelligence and Statistics.
Wouldn’t it be better if you could clearly state what you do as a Data Scientist? You probably agree your work life would be easier if your colleagues and customers could understand what you do.
A biologist wouldn’t say they are a biologist because they work with petri dishes as opposed to experiments to understand life. However some Data Science definitions focus on use of tools like Hadoop.
A physicist wouldn’t say they are a physicist because they run simulations of their models as opposed to understanding matter. However some Data Science definitions focus on activities like modelling, data cleaning and visualizations.
All these sciences use statistics to design their experiments and test their hypotheses. Yet some Data Science definitions focus on overlaps of statistics with computer science and unicorns.
[su_spacer]A Definition of Data Science
The secret to defining data science is to focus on the science. Here is a simple definition of Data Science:
Data Science is the application of the scientific method to find opportunities and efficiencies in business data
There are a few things to note about this definition:
it’s technology agnostic. It’s not about Big Data, Hadoop or whatever the next technology breakthrough might be.
it’s applied to finding opportunities and efficiencies in data. It’s not the study of data – that’s statistics.
it’s not about activities that may be part of the lifecycle of working with data.
most importantly, it uses the scientific method, “systematic observation, measurement, and experiment, and the formulation, testing, and modification of hypotheses” .
The application of the scientific method is central to data science and something I want to come back to in a more detailed post.