How do you effectively define a data science process?
Conceptually, a data science process explains and defines how a team should execute a project. Having a robust, repeatable process helps to ensure that the project efficiently and effectively delivers actionable insight.
In this article, I’ll explore how to create a well-defined data science process in three steps:
- Select a data science life cycle
- Select a coordination framework
- Integrate the above two
Following these steps will help your team know the data science steps as well as how to effectively coordinate and communicate to deliver better project outcomes.
Data science process training
But read below for a quick overview on getting started.
A process is not just a life cycle
Raj Bandyopadhyay describes his data science process as the conceptual steps that he does to execute a data scientist project (framing the problem, collecting, processing, exploring and then modeling the data, as finally communicating the results). Similarly, when Chanin Nantasenamat discusses the data science process, he describes a similar data science life cycle (collect, clean, explore, model and deploy). As Chanin explicitly notes, both of these are based on CRISP-DM data science life cycle framework.
In fact, when I ask people who lead data science teams about their data science process, most will describe a data science life cycle (i.e., their data science workflow – such as first obtaining data, then cleaning the data, and then creating a machine learning model).
However, while defining a life cycle is certainly useful, defining a life cycle is not the same as defining a robust data science team process.
Shortcomings of just a data science life cycle
Indeed, having a well-defined data science life cycle is certainly an important aspect of a team’s process.
But, if you talk just about your team’s life cycle (i.e., the team’s data science workflow), you miss a key aspect of the process! Namely, how the team should coordinate their work. This is becoming more critical as data science is becoming more of a team sport with a diverse set of roles.
While most life cycle frameworks explicitly note that the team might need to “loop back” to a previous phase, these frameworks do not define when (or why or how) the team should loop back to a previous phase.
So, if a data science team just uses a life cycle framework, the team would still need to define how/when to loop back to a previous phase.
Problems if you don’t have a data science process
That is why you also need to define the process of how the team prioritizes work and communicates information across the project team (which I refer to as the”data science collaboration process”). Without effective communication across the team, your stakeholder might think that:
- The model/insight generated is not useful (or they don’t trust the data and/or the model).
- The data science team is not productive (because the stakeholders do not understand what is required to do a full machine learning project).
- The data science team is not focused on the highest priority tasks (because there is not a clear way the stakeholders to coordinate and collaborate with the data science team).
In many ways, the process data science teams use is similar to how software teams were led 30 years ago – teams focus on what to do, but not how to do it.
Benefits of a data science process
While it does take some time and energy, defining a robust data science process is a worthwhile effort. I have seen firsthand, that by addressing these three critical questions, one can lead a data science team more efficiently and effectively.
This improvement is driven by the fact that the data science team will have a common vocabulary (within the team and with stakeholders) with respect to the work that needs to get done to, for example, implement a machine learning model. It will also provide a way to more easily discuss with stakeholders how to prioritize potential efforts as well as how to ensure the insights generated from the machine learning models are actionable by the client organization.
Steps to define a data science process
So, to help a team define an effective data science process, the rest of this blog addresses these three key questions:
- What data science life cycle (data science workflow process) might a team use during a project?
- What framework could be used to help teams improve how they work together?
- How should a data science team integrate their data science life cycle framework with their data science coordination framework?
Step 1: Select a data science life cycle
There are many data science life cycles (workflows). They tend to be similar in defining the data science steps necessary to deliver a project. We’ll review CRISP-DM and TDSP — two of the most prominent. Read my other post for additional data science workflows.
- Business Understanding: determine business objectives; assess situation; determine data mining goals; produce project plan
- Data Understanding: collect initial data; describe data; explore data; verify data quality
- Data Preparation (generally, the most time-consuming phase): select data; clean data; construct data; integrate data; format data
- Modeling: select modeling technique; generate test design; build model; assess model
- Evaluation: evaluate results; review process; determine next steps
- Deployment: plan deployment; plan monitoring and maintenance; produce final report; review project
Team Data Science Process
In 2016, Microsoft introduced another framework that defines a data science life cycle called TDSP (Team Data Science Process). It defines five stages of the data science life cycle:
- Business Understanding
- Data Acquisition and Understanding
- Customer Acceptance)
TDSP also defined four project roles (Group Manager, Team Lead, Project Lead, and Individual Contributor) and ten artifacts to be completed within a specified project stage. In short, TDSP tries to modernize the CRISP-DM phases and introduce some additional structure (e.g., roles).
Step 2: Select a data science coordination framework
Likewise, there are a lot of coordination frameworks. We’ll review three of the more common and more effective ones. These are all agile frameworks that leverage the foundational Agile Principles from the Agile Manifesto.
One approach teams use to help coordinate and prioritize their work is Kanban, which helps teams split the work into pieces (each piece is a task) and then pull the work as capacity permits (rather than work being pushed into the process when requested).
Kanban provides a set of principles that helps teams be more agile by reducing their work-in-progress and enabling teams to re-prioritize tasks as needed (based on the results of previous tasks). In short, Kanban’s two main principles are:
- Visualize the flow – A Kanban board visually represents work via tasks that flow across named columns of increasing work completion
- Minimize work-in-progress – Focus on completing tasks in progress, so that insight can be gained via completed tasks (to inform what might be useful future tasks)
However, while useful, Kanban does not define how a team might coordinate and prioritize what to be done. So, a team that uses Kanban needs to define additional structure to help the them, for example, prioritize tasks.
Like CRISP-DM, Scrum was defined in the 1990s. Unlike CRISP, Scrum’s definition is updated frequently with the most release late in 2020 on scrumguides.org. Like Kanban, Scrum defines a coordination framework (i.e., how a team prioritizes tasks, and hence, helps them decide when to “loop back”).
In fact, Scrum is the most popular team coordination framework for software development projects. Therefore, many people naturally think of using Scrum for data science projects. For example, Scrum defines meetings, roles, artifacts and a process to execute iterative fixed duration sprints.
However, there are several challenges when using Scrum in a data science context. Most notably, it can be very difficult to estimate how long data science tasks will take, which makes sprint definition very challenging.
Data Driven Scrum
Data Driven Scrum (DDS) is a newer framework that addresses many of the challenges encountered when using Scrum.
DDS leverages some of the key aspects of the original scrum (such as roles), but defines an iteration framework that is much more applicable for data science projects. For example, iterations are not time-boxed, but rather, are defined by a small set of tasks (often an experiment or hypothesis), which has create, observe and analyze tasks.
Step 3: Integrate the other steps
Let’s say the data science team or the data science team leader selects a life cycle framework and a data science appropriate coordination framework. Then the next question is: “How do we integrate these two frameworks?”
Unfortunately, there is not a standard approach to answer this question.
One way to achieve this integration is defining an iteration to be one “loop” through the life cycle (phases of a project). An alternative approach is to have an iteration be comprised of one phase in the project life cycle. Yet another approach is to use the life cycle as a vocabulary, such that it’s clear which phase (or phases) the current iteration is focused on progressing.
The best path forward depends on the project and the project team. But in any event, it should be noted that these approaches could work independently of the team using CRISP-DM, TDSP or a different life cycle framework.
This article just touches the surface of explaining what might be an appropriate data science process for a data science team.
Defining and using an effective agile data science process certainly takes more time, effort and knowledge than just reading this post. If you are interested in understanding this topic in more depth, you could explore becoming a Certified Data Science Team Lead through the Data Science Process Alliance.