Uncategorized

What is a Data Science Workflow?

A data science workflow defines the phases (or steps) in a data science project. Using a well-defined data science workflow is useful in that it provides a simple way to remind everyone on a team of the work to be done to do a data science project.

One way to think about the benefit of having a well-defined data science workflow is that your project’s workflow is like a set of guardrails to help you plan, organize, and implement your data science project.

Existing Workflows

The concept of a data science workflow is not new, and there are many workflow frameworks that a team can use. To show the diversity of possible workflow frameworks, this post describes:

  • One workflow used within a data science course at Harvard
  • Three workflows defined within blogs where they discuss their specific workflow used
  • Two more well-known frameworks that have been used by numerous teams.

Data Science Workflow Training

The rest of this article explores these existing workflow frameworks, and at the end, provides an integrated view of the different workflows.

If you want more details on workflows and how to integrate them within your project, explore training from the Data Science Process Alliance.

Photo by Headway on Unsplash

Data Science Workflow Taught at Harvard

Let’s start with Blitzstein & Pfister’s workflow, which is used in Harvard’s introductory data science course. It has a five-phase framework and the process acknowledges that there are iterations between the phases. The five phases are:

  1. Ask an interesting question
  2. Get the data
  3. Explore the data
  4. Model the data
  5. Communicate and visualize the results
Harvard Data Science Workflow

Blogs Describing a Data Science Workflow

Perhaps not surprisingly, there are numerous blog posts, where people have explained their own workflow.

Aakash Tandel’s Workflow

For example, a workflow described by Aakash Tandel provides a high-level data science workflow, with a goal of serving as an example for new data scientists. It includes the following five logical steps:

  1. Understand the objective
  2. Import the data
  3. Explore and clean the data
  4. Model the data
  5. Communicate the results.

Aakanksha Joshi’s Workflow

In a different blog, Aakanksha Joshi discussed using a data science workflow leveraging IBM’s Watson Studio Cloud, but the workflow could be useful independent of the technology stack used. In her blog, Joshi describes five linear phases:

  1. Connect & access data
  2. Search and find relevant data
  3. Prepare for data analysis
  4. Build/train/deploy models
  5. Monitor/analyze/manage models

Philip Guo’s Workflow

A more advanced framework was described by Philip Guo. As shown below, it has four main phases.

  1. Preparation of the data, then alternating between
  2.  Analysis 
  3.  Reflection to interpret the outputs, and finally 
  4. Dissemination of results in the form of written reports and/or executable code.
Guo's Data Science Workflow

Note that Guo’s workflow has more detailed steps, and equally important, explicitly acknowledges that there are “loops” within a project.

Exploring Well-known Data Science Workflow Frameworks

In addition to blogs, there are several workflow frameworks that have become fairly well known, so two are reviewed below (CRISP-DM and OSEMN).

CRISP-DM

CRISP-DM Lifecycle

CRISP-DMDefined to standardize a data mining process across industries, CRoss-Industry Standard Process for Data Mining (CRISP-DM) is the most well-known framework used to define a data science workflow. As shown in the standard CRISP-DM visual workflow, it describes six iterative phases. Each phase (Business understanding, data understanding, data preparation, modeling, evaluation and deployment) has its own defined tasks and set of deliverables such as documentation and reports. Each phase has its own defined tasks and set of deliverables (including documentation and reports). Projects can “loop back” as needed to a previous phase.

OSEMN

OSEMN (Rhymes with possum) was first described in a 2010 blog explaining a Taxonomy of Data Science. OSEMN describes five phases for a data science project: Obtain, Scrub, Explore, Model, and iNterpret. As noted by one by one of the authors of that blog, OSEMN:

Documents the process of data science … this seems completely obvious to those of us in this room today, right. You get some data, you clean it up, play around with it, you build a robust model and, you know, you make a graph, or write about it. This was not obvious in 2010.

Hilary Mason

This is a good point – data science workflows typically seem “obvious” to data scientists, but there is still value in defining a workflow for your project, in that it helps to ensure everyone on the team understands the work to be done and also helps to make sure that a step is not skipped or forgotten.

Comparing the Different Workflows

So, what are the similarities and differences between these workflow frameworks?

These frameworks all typically focus on the steps in a data science project (or skills needed by a data scientist). Perhaps the most significant difference is that some explicitly discuss the need to loop back to a previous phase. Another difference is that some focus more on understanding the business context, and yet another difference is that some focus on the deployment of models.

To explore the different workflows, the table below shows the frameworks discussed, and compares and contrasts the different phases defined within each workflow framework. This table might help you decide what phases are best for your team, and if you want to use one of these previously defined workflows or define your own, based on selecting the phases that make the most sense for your team.

HarvardCRISP-DMOSEMNGuo’sTandel’sJoshi’s
UnderstandAsk an interesting questionBusiness understandingUnderstand the objective
AcquireGet the dataData understandingObtain Prepare (acquire)Import the dataFind, connect and access data
CleanData preparationScrubPrepare (clean)Clean the dataPrepare the data
ExploreExplore the dataExploreExplore the data
ModelModel the dataModelingModelAnalysisModel the dataBuild models
EvaluateEvaluationiNterpretReflect
CommunicateCommunicate / VisualizeDisseminateCommunicate results
DeployDeploymentDeploy models
MonitorMonitor models

Learn More

While many of these workflows note the need to loop back to a previous phase, none of these workflows explicitly describe how a team should determine when to loop back (or when to progress to the next phase).

That is why a workflow is just part of a team’s overall process framework. To learn more, you can explore:

Leave a Reply

Your email address will not be published. Required fields are marked *