What is a Data Science Workflow?

by | Last updated Dec 7, 2024 | Life Cycle

A data science workflow defines the phases (or steps) in a data science project. Using a well-defined data science workflow is useful in that it provides a simple way to remind all data science team members of the work to be done to do a data science project.

One way to think about the benefit of having a well-defined data science workflow is that it is like a set of guardrails to help you plan, organize, and implement your data science project.

Existing Workflows

The concept of a data science workflow is not new, and there are many frameworks that a team can use. To show the diversity of possible workflow frameworks, this post describes:

  • One workflow used within a data science course at Harvard
  • Three workflows defined within blogs where they discuss their specific workflow used
  • Two more well-known frameworks used by numerous teams.

Data Science Workflow Training

The rest of this article explores these existing workflow frameworks, and at the end, provides an integrated view of them.

Data Science Workflow Taught at Harvard

Let’s start with Blitzstein & Pfister’s workflow, which is used in Harvard’s introductory data science course. It has a five-phase framework and the process acknowledges that there are iterations between the phases.

The five phases are:

  1. Ask an interesting question
  2. Get the data
  3. Explore the data
  4. Model the data
  5. Communicate and visualize the results

 

Blogs Describing a Data Science Workflow

Perhaps not surprisingly, there are numerous blog posts, where people have explained their own workflow.

Aakash Tandel’s Workflow

For example, a workflow described by Aakash Tandel provides a high-level data science workflow, with a goal of serving as an example for new data scientists.

It includes the following five logical steps:

  1. Understand the objective
  2. Import the data
  3. Explore and clean the data
  4. Model the data
  5. Communicate the results

Aakanksha Joshi’s Workflow

In a different blog, Aakanksha Joshi discussed using a data science workflow leveraging IBM’s Watson Studio Cloud, but the workflow could be useful independent of the technology stack used.

In her blog, Joshi describes five linear phases:

  1. Connect & access data
  2. Search and find relevant data
    Prepare for data analysis
  3. Build/train/deploy models
  4. Monitor/analyze/manage models

Philip Guo’s Workflow

A more advanced framework was described by Philip Guo. As shown below, it has four main phases.

  1. Preparation of the data, then alternating betweenAnalysis
  2. Reflection to interpret the outputs, and finally
  3. Dissemination of results in the form of written reports and/or executable code.

Note that Guo’s workflow has more detailed steps, and equally important, explicitly acknowledges that there are “loops” within a project.

Well-known Data Science Workflows

In addition to blogs, there are several workflow frameworks that have become fairly well known, so two are reviewed below (CRISP-DM and OSEMN).

CRISP-DM

CRISP-DM: Defined to standardize a data mining process across industries, CRoss-Industry Standard Process for Data Mining (CRISP-DM) is the most well-known framework used to define a data science workflow. As shown in the standard CRISP-DM visual workflow, it describes six iterative phases.

Each phase (Business understanding, data understanding, data preparation, modeling, evaluation and deployment) has its own defined tasks and set of deliverables such as documentation and reports. Each phase has its own defined tasks and set of deliverables (including documentation and reports). Projects can “loop back” as needed to a previous phase

OSEMN

OSEMN (Rhymes with possum) was first described in 2010. It has five phases for a data science project: Obtain, Scrub, Explore, Model, and iNterpret. As noted by one by one of the authors of that blog,

OSEMN:

Documents the process of data science … this seems completely obvious to those of us in this room today, right. You get some data, you clean it up, play around with it, you build a robust model and, you know, you make a graph, or write about it. This was not obvious in 2010.
– Hilary Mason

This is a good point – data science workflows typically seem “obvious” to data scientists, but there is still value in defining a workflow for your project, in that it helps to ensure everyone on the team understands the work to be done and also helps to make sure that a step is not skipped or forgotten.

Comparing the Different Workflows

So, what are the similarities and differences between these workflow frameworks?
These frameworks all typically focus on the steps in a data science project (or skills needed by a data scientist). Perhaps the most significant difference is that some explicitly discuss the need to loop back to a previous phase. Another difference is that some focus more on understanding the business context, and yet another difference is that some focus on the deployment of models.

To explore the different workflows, the table below shows the frameworks discussed, and compares and contrasts the different phases defined within each workflow framework. This table might help you decide what phases are best for your team, and if you want to use one of these previously defined workflows or define your own, based on selecting the phases that make the on selecting the phases that make the most sense for your team.

 

Harvard CRISP-DM OSEMN Guo’s Tandel’s Joshi’s
Understand Ask an interesting question Business understanding Understand the objective
Acquire Get the data Data understanding Obtain Prepare (acquire) Import the data Find, connect and access data
Clean Data preparation Scrub Prepare (clean) Clean the data Prepare the data
Explore Explore the data Explore Explore the data
Model Model the data Modeling Model Analysis Model the data Build models
Evaluate Evaluation iNterpret Reflect
Communicate Communicate / Visualize Disseminate Communicate results
Deploy Deployment Deploy models
Monitor Monitor models

 

Learn More

While many of these workflows note the need to loop back to a previous phase, none of these workflows explicitly describe how a team should determine when to loop back (or when to progress to the next phase).

That is why a workflow is just part of a team’s overall process framework. To learn more, these are two good starting points: 

Explore Related Content

Finally...a field guide for managing data science projects!

Data science is unique. It's time to start managing it as such.

Get the jumpstart guide to manage your next project better.

Plus get monthly tips in data science project management.

You have Successfully Subscribed!

Share This