In this article, I’ll explore how to create a well-defined data science process.
Specifically, I’ll explain how a team’s data science process is comprised of (1) a team’s data science life cycle framework (i.e., the team’s data science process workflow), (2) a team’s coordination framework, and (3) the integration of these two frameworks.
Data Science Process Training
But read below for a quick overview on how to define a data science process.
A data science process is not just a data science life cycle
When I ask people who lead data science teams about their data science process, many will describe a data science life cycle (i.e., their data science process workflow – such as first obtaining data, then cleaning the data, and then creating a machine learning model). Others give a vague answer about “working as a team to get the work done”.
However, while defining a life cycle is certainly useful, defining a life cycle is not the same as defining a robust data science team process.
In other words, while having a well-defined data science life cycle is certainly an important aspect of a team’s process, if one just talks about the team’s life cycle (i.e., the team’s data science workflow), one misses a key aspect of the process! Namely, how the team should coordinate their work.
Defining a collaboration process
While most life cycle frameworks explicitly note that the team might need to “loop back” to a previous phase, these frameworks do not define when (or why or how) the team should loop back to a previous phase. So, if a data science team just uses a life cycle framework, the team itself would still need to define how / when to loop back to a previous phase.
That is why it is also important to define the process of how the team prioritizes work and communicates information across the project team (which I refer to as the”data science collaboration process”). Without an effective way to communicate across the team, groups often hear that their stakeholders / clients think that:
- The model/insight generated is not useful (or they don’t trust the data and/or the model).
- The data science team is not productive (because the stakeholders do not understand what is required to do a full machine learning project).
- The data science team is not focused on the highest priority tasks (because there is not a clear way the stakeholders to coordinate and collaborate with the data science team).
In many ways, the process data science teams use is similar to how software teams were led 30 years ago – teams focus on what to do, but not how to do it.
So, to help a team define an effective data science process, the rest of this blog addresses these three key questions:
- What data science life cycle (data science workflow process) might a team use during a project?
- What framework could be used to help teams improve how they work together?
- How should a data science team integrate their data science life cycle framework with their data science coordination framework?
1. Select a data science life cycle
(data science process workflow)
CRISP-DM, which was designed in the 1990s, is the most commonly used framework for describing the steps in a data science project. It defines 6 phases of a project:
- Business Understanding: determine business objectives; assess situation; determine data mining goals; produce project plan
- Data Understanding: collect initial data; describe data; explore data; verify data quality
- Data Preparation (generally, the most time-consuming phase): select data; clean data; construct data; integrate data; format data
- Modeling: select modeling technique; generate test design; build model; assess model
- Evaluation: evaluate results; review process; determine next steps
- Deployment: plan deployment; plan monitoring and maintenance; produce final report; review project
Another framework that defines a data science life cycle is TDSP (Team Data Science Process), which was introduced by Microsoft in 2016. It defines 5 stages of the data science life cycle (Business Understanding, Data Acquisition and Understanding, Modeling, Deployment, Customer Acceptance), 4 project roles (Group Manager, Team Lead, Project Lead, and Individual Contributor) and 10 artifacts to be completed within a specified project stage. In short, TDSP tries to modernize the CRISP-DM phases and introduce some additional structure (e.g., roles).
There are many other life cycle frameworks, but most of these (such as Domino Data Lab’s framework), are fairly similar in nature – that is, describing the steps in a data science project.
2. Select a data science coordination framework
One approach teams use to help coordinate and prioritize their work is Kanban, which helps teams split the work into pieces (each piece is a task) and then pull the work as capacity permits (rather than work being pushed into the process when requested). Kanban provides a set of principles that helps teams be more agile by reducing their work-in-progress and enabling teams to re-prioritize tasks as needed (based on the results of previous tasks). In short, Kanban’s Two main Principles are:
- Visualize the flow – A Kanban board visually represents work via tasks that flow across named columns of increasing work completion
- Minimize work-in-progress – Focus on completing tasks in progress, so that insight can be gained via completed tasks (to inform what might be useful future tasks)
However, while useful, Kanban does not define how a team might coordinate and prioritize what to be done. So, a team that uses Kanban needs to define additional structure to help the them, for example, prioritize tasks.
Agile coordination frameworks
Scrum, which like CRISP-DM was defined in the 1990s, does define a coordination framework (i.e., how a team prioritizes tasks, and hence, helps them decide when to “loop back”). In fact, Scrum is the most popular team coordination framework for software development projects, and so, many people naturally think of using Scrum for data science projects. For example, Scrum defines meetings, roles, artifacts and a process to execute iterative fixed duration sprints. However, there are several challenges when using Scrum in a data science context (such as it can be very difficult to estimate how long data science tasks will take, which makes defining what is in a sprint very challenging).
Data Driven Scrum (DDS) is a newer framework that addresses many of the challenges encountered when using Scrum. DDS leverages some of the key aspects of the original scrum (such as roles), but defines an iteration framework that is much more applicable for data science projects. For example, iterations are not time-boxed, but rather, are defined by a small set of tasks (often an experiment or hypothesis), which has create, observe and analyze tasks.
3. Integrate the team’s life cycle and coordination framework
If a data science team (or the data science team leader) selects a life cycle framework as well as a data science appropriate coordination framework, an obvious question to be addressed is “how do we integrate these two frameworks”?
One way to achieve this integration is defining the iteration to be one “loop” through the life cycle phases of a project. An alternative approach is to have an iteration be comprised of one phase in the project life cycle.
Either of these approaches could work independent of the team using CRISP-DM, TDSP or any other life cycle framework.
This article just touches the surface of explaining what might be an appropriate data science process for a data science team. Defining and using an effective agile data science process certainly takes more time, effort and knowledge than just reading this post. If you are interested in understanding this topic in more depth, you could explore becoming a certified data science team lead.
While it does take some time and energy, defining a robust data science process is a worthwhile effort. I have seen firsthand, that by addressing these three critical questions, one can lead a data science team more efficiently and effectively. This improvement is driven by the fact that the data science team will have a common vocabulary (within the team and with stakeholders) with respect to the work that needs to get done to, for example, implement a machine learning model. It will also provide a way to more easily discuss with stakeholders how to prioritize potential efforts as well as how to ensure the insights generated from the machine learning models are actionable by the client organization.
Possible next steps
However, having a team effectively use an agile data science process certainly takes more time, effort and knowledge than just reading this post.