Coordination Framework

The Data Science Process

Three steps to define an effective data science process

Conceptually, a data science process explains and defines how a team should execute a project. Having a robust, repeatable process helps to ensure that the project efficiently and effectively delivers actionable insight.

In this article, I’ll explore how to create a well-defined data science process in three steps:

  1. Select a data science life cycle
  2. Select a coordination framework
  3. Integrate the above two frameworks

Following these steps will help your team effectively deliver better project outcomes.

Data science process training

Looking for a more in-depth explanation of how to effectively use a data science process for your projects? If so, explore the Data Science Team Lead certification.

Or, read below for a quick overview on getting started.


A data science process is not just a life cycle

I often ask people who lead data science teams to describe their data science process. Most describe a data science life cycle. The life cycle (sometimes called workflow) is the set of steps to do a data science project. This typically includes obtaining data, cleaning the data, and then creating a machine learning model.

For example, Raj Bandyopadhyay’s post describes his data science process. His process is defined by the conceptual steps to execute a data science project. These steps include framing the problem, collecting, processing, exploring and then modeling the data, as finally communicating the results). Similarly, Chanin Nantasenamat’s data science process defines a similar data science life cycle. Chanin’s steps are to collect, clean, explore, model and then deploy. As Chanin explicitly notes, both of these are based on the CRISP-DM data science life cycle framework.

Defining a life cycle is certainly useful. However, defining a life cycle is not the same as defining a robust data science team process.

Multiple ways to define a data science process
Creating an effective data science team process requires answering 3 key questions
(© unsplash.com)

Shortcomings of just using a data science life cycle

Indeed, having a well-defined data science life cycle is certainly an important aspect of a team’s process.

But, if you talk just about your team’s life cycle, you miss a key aspect of the process! Namely, how the team should coordinate their work. This is critical as data science is becoming more of a team sport with a diverse set of roles.

Most life cycle frameworks explicitly note that the team might need to “loop back” to a previous phase. However, these frameworks do not define when (or why or how) the team should loop back to a previous phase.

So, if a data science team just uses a life cycle framework, the team would still need to define how/when to loop back to a previous phase.

Problems if you don’t have a data science process

That is why you also need to define the process of how the team prioritizes work and communicates information across the project team (which I refer to as the”data science collaboration process”). Without effective communication across the team, your stakeholder might think that:

  • The model/insight generated is not useful (or they don’t trust the data and/or the model).
  • The data science team is not productive (stakeholders don’t understand what is required to do a full machine learning project).
  • The data science team is not focused on the highest priority tasks. In other words, there is no clear way for stakeholders to coordinate and collaborate with the data science team.

In many ways, the process data science teams use is similar to how software teams were led 30 years ago. That is, teams focus on what to do, but not how to do it.

Benefits of a data science process

While it does take some time and energy, defining a robust data science process is a worthwhile effort. By addressing these three questions, one can more efficiently and effectively lead a data science team.

This improvement is driven by the fact that the data science team and stakeholders will have a common vocabulary. This common vocabulary is with respect to the work that needs to get done, for example, to implement a machine learning model. It will also provide a way to more easily discuss with stakeholders how to prioritize potential efforts. Finally, it will also help ensure that the insights generated from the machine learning models are actionable.

Steps to define a data science process

So, to help a team define an effective data science process, the rest of this blog addresses these three key questions:

  1. What data science life cycle (data science workflow process) might a team use during a project?
  2. What framework could be used to help teams improve how they work together?
  3. How should a data science team integrate their life cycle framework with their coordination framework?

Looking for Help Hiring a Data Science Manager?

Explore DSPA’s range of solutions, ranging from a turnkey executive search to coaching your newly data science manager (i.e., after that person has been hired).


Step 1: Select a data science life cycle

There are many data science life cycles (workflows). They tend to be similar in defining the data science steps necessary to deliver a project. We’ll review CRISP-DM and TDSP — two of the most prominent. Read my other post for additional data science workflows.

CRISP-DM

CRISP-DM was designed in the 1990s. It is the most commonly used framework for describing the steps in a data science project. CRISP-DM defines six phases of a project:

CRISP-DM, the most common life cycle framework in a data science process
CRISP DM Life Cycle
  1. Business Understanding: determine business objectives; assess situation; determine data mining goals; produce project plan
  2. Data Understanding: collect initial data; describe data; explore data; verify data quality
  3. Data Preparation (generally, the most time-consuming phase): select data; clean data; construct data; integrate data; format data
  4. Modeling: select modeling technique; generate test design; build model; assess model
  5. Evaluation: evaluate results; review process; determine next steps
  6. Deployment: plan deployment; plan monitoring and maintenance; produce final report; review project

Team Data Science Process

In 2016, Microsoft introduced another framework that defines a data science life cycle called TDSP (Team Data Science Process). It defines five stages of the data science life cycle:

Microsoft Team Data Science Process
TDSP from Microsoft
  1. Business Understanding
  2. Data Acquisition and Understanding
  3. Modeling
  4. Deployment
  5. Customer Acceptance)

TDSP defines four project roles (Group Manager, Team Lead, Project Lead, and Individual Contributor). It also defines ten artifacts that are within a specified project stage. In short, TDSP tries to modernize the CRISP-DM phases and introduce some additional structure (e.g., roles).

Step 2: Select a data science coordination framework

Likewise, there are a lot of coordination frameworks. We’ll review three of the more common and more effective ones. These are all agile frameworks that leverage the foundational Agile Principles from the Agile Manifesto.

Kanban

One approach teams use to help coordinate and prioritize their work is Kanban. Kanban helps teams split the work into pieces (each piece is a task). It also enables teams to pull the work as capacity permits (rather than work being pushed into the process when requested).

Kanban provides a set of principles that helps teams be more agile. This agility comes by enabling teams to re-prioritize tasks as needed (based on the results of previous tasks). In short, Kanban’s two main principles are:

Kanban Board
  • Visualize the flow – A Kanban board visually represents work via tasks that flow across named columns of increasing work completion 
  • Minimize work-in-progress – Focus on completing tasks in progress. Gain insight via completed tasks (to help prioritize future tasks).

However, while useful, Kanban does not define how a team might coordinate and prioritize what to be done. So, a team that uses Kanban needs to define additional structure to help them, for example, prioritize tasks.

Scrum

Like CRISP-DM, Scrum was defined in the 1990s. Unlike CRISP, Scrum’s definition is updated frequently with latest release in 2020 on scrumguides.org. Like Kanban, Scrum defines a coordination framework. In other words, Scrum defines how a team prioritizes tasks, and hence, helps them decide when to “loop back”.

Scrum Framework

In fact, Scrum is the most popular team coordination framework for software development projects. Therefore, many people naturally think of using Scrum for data science projects. For example, Scrum defines meetings, roles, artifacts and a process to execute iterative fixed duration sprints.

However, there are several challenges when using Scrum in a data science context. Most notably, it can be very difficult to estimate how long data science tasks will take. This can make sprint definition very challenging.

Data Driven Scrum

A newer framework, Data Driven Scrum (DDS), addresses many of the challenges encountered when using Scrum is .

data driven scrum
Data Driven Scrum

DDS leverages some of the key aspects of the original scrum (such as roles). However, DDS defines an iteration framework that is much more applicable for data science projects. For example, iterations are not time-boxed, but rather, are defined by completing a small set of tasks. Furthermore, each iteration needs to explicitly define create, observe and analyze tasks.

Step 3: Integrate the two frameworks

Let’s say the data science team selects a life cycle framework and a data science appropriate coordination framework. Then the next question is: “How do we integrate these two frameworks?”

Unfortunately, there is not a standard approach to answer this question.

One way to achieve this integration is defining an iteration to be one “loop” through the life cycle. An alternative approach is to have an iteration be comprised of one phase in the project life cycle. Yet another approach is to use the life cycle as a vocabulary. This way, the focus of the current iteration is clear.

The best path forward depends on the project and the project team. Note that these approaches could work independently of the team using CRISP-DM, TDSP or a different life cycle framework.

Learn more

This article just touches the surface of explaining what might be an appropriate data science process.

Defining and using an effective agile data science process certainly requires more effort than just reading this post. Interested in understanding this topic in more depth? If so, you could explore becoming a Certified Data Science Team Lead or our consulting services.

Read Related Posts


Curious? Explore more.

Learn the five unique challenges of data science projects and how to overcome them.

Get a grasp on CRISP-DM, Scrum, and Data Driven Scrum.

And understand how to leverage best practices to deliver data science outcomes.

data science project management - defining a better data science process