A data science workflow defines the phases (or steps) in a data science project. Using a well-defined data science workflow is useful in that it provides a simple way to remind everyone on a team of the work to be done to do a data science project.
One way to think about the benefit of having a well-defined data science workflow is that your project’s workflow is like a set of guardrails to help you plan, organize, and implement your data science project.
The concept of a data science workflow is not new, and there are many workflow frameworks that a team can use. To show the diversity of possible workflow frameworks, this post describes:
- One workflow used within a data science course at Harvard
- Three workflows defined within blogs where they discuss their specific workflow used
- Two more well-known frameworks that have been used by numerous teams.
Data Science Workflow Training
The rest of this article explores these existing workflow frameworks, and at the end, provides an integrated view of the different workflows.
Data Science Workflow Taught at Harvard
Let’s start with Blitzstein & Pfister’s workflow, which is used in Harvard’s introductory data science course. It has a five-phase framework and the process acknowledges that there are iterations between the phases. The five phases are:
- Ask an interesting question
- Get the data
- Explore the data
- Model the data
- Communicate and visualize the results
Blogs Describing a Data Science Workflow
Perhaps not surprisingly, there are numerous blog posts, where people have explained their own workflow.
Aakash Tandel’s Workflow
For example, a workflow described by Aakash Tandel provides a high-level data science workflow, with a goal of serving as an example for new data scientists. It includes the following five logical steps:
- Understand the objective
- Import the data
- Explore and clean the data
- Model the data
- Communicate the results.
Aakanksha Joshi’s Workflow
In a different blog, Aakanksha Joshi discussed using a data science workflow leveraging IBM’s Watson Studio Cloud, but the workflow could be useful independent of the technology stack used. In her blog, Joshi describes five linear phases:
- Connect & access data
- Search and find relevant data
- Prepare for data analysis
- Build/train/deploy models
- Monitor/analyze/manage models
Philip Guo’s Workflow
A more advanced framework was described by Philip Guo. As shown below, it has four main phases.
- Preparation of the data, then alternating between
- Reflection to interpret the outputs, and finally
- Dissemination of results in the form of written reports and/or executable code.
Note that Guo’s workflow has more detailed steps, and equally important, explicitly acknowledges that there are “loops” within a project.
Exploring Well-known Data Science Workflow Frameworks
In addition to blogs, there are several workflow frameworks that have become fairly well known, so two are reviewed below (CRISP-DM and OSEMN).
CRISP-DM: Defined to standardize a data mining process across industries, CRoss-Industry Standard Process for Data Mining (CRISP-DM) is the most well-known framework used to define a data science workflow. As shown in the standard CRISP-DM visual workflow, it describes six iterative phases. Each phase (Business understanding, data understanding, data preparation, modeling, evaluation and deployment) has its own defined tasks and set of deliverables such as documentation and reports. Each phase has its own defined tasks and set of deliverables (including documentation and reports). Projects can “loop back” as needed to a previous phase.
OSEMN (Rhymes with possum) was first described in a 2010 blog explaining a Taxonomy of Data Science. OSEMN describes five phases for a data science project: Obtain, Scrub, Explore, Model, and iNterpret. As noted by one by one of the authors of that blog, OSEMN:
Documents the process of data science … this seems completely obvious to those of us in this room today, right. You get some data, you clean it up, play around with it, you build a robust model and, you know, you make a graph, or write about it. This was not obvious in 2010.– Hilary Mason
This is a good point – data science workflows typically seem “obvious” to data scientists, but there is still value in defining a workflow for your project, in that it helps to ensure everyone on the team understands the work to be done and also helps to make sure that a step is not skipped or forgotten.
Comparing the Different Workflows
So, what are the similarities and differences between these workflow frameworks?
These frameworks all typically focus on the steps in a data science project (or skills needed by a data scientist). Perhaps the most significant difference is that some explicitly discuss the need to loop back to a previous phase. Another difference is that some focus more on understanding the business context, and yet another difference is that some focus on the deployment of models.
To explore the different workflows, the table below shows the frameworks discussed, and compares and contrasts the different phases defined within each workflow framework. This table might help you decide what phases are best for your team, and if you want to use one of these previously defined workflows or define your own, based on selecting the phases that make the most sense for your team.
|Understand||Ask an interesting question||Business understanding||Understand the objective|
|Acquire||Get the data||Data understanding||Obtain||Prepare (acquire)||Import the data||Find, connect and access data|
|Clean||Data preparation||Scrub||Prepare (clean)||Clean the data||Prepare the data|
|Explore||Explore the data||Explore||Explore the data|
|Model||Model the data||Modeling||Model||Analysis||Model the data||Build models|
|Communicate||Communicate / Visualize||Disseminate||Communicate results|
While many of these workflows note the need to loop back to a previous phase, none of these workflows explicitly describe how a team should determine when to loop back (or when to progress to the next phase).
That is why a workflow is just part of a team’s overall process framework. To learn more, you can explore: