A data science life cycle is an iterative set of steps you take to deliver a data science project or product. Because every data science project and team are different, every specific data science life cycle is different. However, most data science projects tend to flow through the same general life cycle.
With help from our friends at ACME, Inc, we’ll illustrate how a hypothetical project progresses through a typical data science life cycle framework.
A General Data Science Life Cycle
Some data science life cycles narrowly focus on just the data, modeling, and assessment steps. Others are more comprehensive and start with business understanding and end with deployment.
And the one we’ll walk through is even more extensive to include operations. It also emphasizes agility more than other life cycles.
This life cycle has five steps:
- Problem Definition
- Data Investigation and Cleaning
- Minimal Viable Model
- Deployment and Enhancements
- Data Science Ops
These are not linear steps. You will start with step one and then proceed to step two. However, from there, you should naturally flow among the steps as necessary.
Several small iterative steps are better than a few larger comprehensive phases.
Data Science Life Cycle Case Study at ACME
In this purple text, we’ll follow a story that demonstrates this life cycle.
Imagine that you are the Director of the Data Science Center of Excellence at ACME, Inc — a large anvil manufacturing company. A Human Resources Director asks if you can help reduce the company’s spiraling recruitment costs.
Let’s see if you can. But first, let’s define the problem.
Data Science Life Cycle Training
Given the requests Jeff and I have had to provide training, we’ve helped launch the Data Science Process Alliance. Learn about the individual data science certifications or for a brief life cycle overview, read on…
I. Problem Definition
What does this Phase Accomplish?
Just like any good business or IT-focused life cycle, a good data science life cycle starts with “why”. If you’re asking “Why start with why?” (good for you), then read the top point in How to Lead Data Science Teams.
Generally, the project lead or product manager manages this phase. Regardless, this initial phase should:
- State clearly the problem to be solved and why
- Motivate everyone involved to push toward this why
- Define the potential value of the forthcoming project
- Identify the project risks including ethical considerations
- Identify the key stakeholders
- Align the stakeholders with the data science team
- Research related high-level information
- Assess the resources (people and infrastructure) you’ll likely need
- Develop and communicate a high-level, flexible project plan
- Identify the type of problem being solved*
- Get buy-in for the project
What’s Unique in this Life Cycle?
Not much. Does the above list sound generic? Yup. Kicking off a data science project should focus on many of the best practices from general project management. However, there are distinct nuances as described in 10 Questions to Ask Before Starting a Data Science Project.
I’ll highlight just one often overlooked point.
*Identify the type of problem being solved. Many assume that advanced data science methods are the solution. This is often not the case. Therefore a key question you should ask throughout the project (and especially in the early phases) is: “Is the problem best solved by machine learning or something else?” Check out Matt’s post on The Limitations of Machine Learning to dive deeper into this.
Starting the Data Science Life Cycle at ACME
Back to ACME. You:
- Gather the request from the HR Director and her staff
- Take opinions from other company leaders
- Get advice from some recruitment firms
- Read several relevant business journals
- Analyze some of the company’s managerial financial reports
Within a few days, you hypothesize that the underlying problem is not with recruitment costs directly. Rather, the best way to reduce recruitment costs is to reduce existing employee churn.
You get buy-in to develop a model that predicts who is likely to leave your company. With an early warning, managers can intervene which will in turn lower employee churn, decrease recruitment costs, and provide a lot of other benefits.
II. Data Investigation and Cleaning
Without data, you’ve got nothing. Therefore, the team needs to identify what data is needed to solve the underlying problem. Then determine how to get the data:
- Is the data internally available? -> Get access to it
- Is the data readily collectable? -> Start capturing it
- Is the data available for purchase? -> Buy it
Once you have the data, start exploring it. Your data scientists or business/data analysts will lead several activities such as:
- Document the data quality
- Clean the data
- Combine various data sets to create new views
- Load the data into the target location (often to a cloud platform)
- Visualize the data
- Present initial findings to stakeholders and solicit feedback*
What’s Unique in this Life Cycle?
Nearly all data science life cycle frameworks include these above points except for this last one is often left out. It’s key though.
*By presenting initial findings in the form of descriptive statistics, interesting insights, a report, or a dashboard, stakeholders are able to provide further context on things that the data team might have overlooked.
Just as importantly, the stakeholders learn more about what is going on and can help reframe the business problem to focus on existing or new key points.
After all, whatever is defined in the “Problem Definition” phase isn’t golden. It’s a starting point. And your team should pivot as early as possible if necessary.
Don’t underestimate the time required for these steps (particularly data cleaning). But without this, you’ll fall victim to “garbage-in-garbage-out”.
However, don’t go overboard. If you spend too much time in this phase, you’re investing a lot of time toward a project without proven value.
Data Investigation at ACME
Fortunately, the company has most of the needed data available. Unfortunately, the data sets live in several disparate systems, many of which your team has to get HR signoff to access. This process is frustratingly slow. So to not waste time, your data scientist and data analyst identify five useful data sets it can already access.
Using Python notebooks, they:
- combine these data sets
- calculate various descriptive statistics
- search for outliers
- impute missing values (i.e. take educated guesses for null values)
- and ask what is unique about the employees who churn?
Then, to get the HR stakeholders to understand what they’re looking at, your business analyst builds a Tableau dashboard.
Through conversations, the HR stakeholders and your team uncover some interesting insights. Most notably, turnover is particularly high in ACME’s Dynamite Division, and these positions have high recruitment costs
Your team also suspects that factors that lead to involuntary churn are very different from voluntary churn. And by focusing on involuntary churn, they can simplify the model.
Therefore you and the stakeholders agree that the initial model should predict voluntary employee churn in the Dynamite Division.
III. Minimal Viable Model
All data science life cycle frameworks have some sort of modeling phase. However, I want to emphasize the importance of getting something useful out as quickly. This concept borrows from the idea of a Minimal Viable Product.
What is a Minimal Viable Product?
Eric Ries popularized this concept in The Lean Startup:
“The minimum viable product is that version of a new product which allows a team to collect the maximum amount of validated learning about customers with the least effort.”Eric Ries
Or more simply — don’t start out building a full-fledged product and then launch. Rather, get out something of value and receive feedback about whether it is on the right track. And if not, shift directions.
What is a Minimal Viable Model?
Extending this concept to modeling, I define:
“The minimal viable model is the version of a new model which allows a team to collect the maximum amount of validated learning about the model’s effectiveness with the least effort”
Breaking this down:
- Minimal: The model is narrowly focused. It is not the best possible model but is sufficient enough to make a measurable impact on a subset of the overall problem.
- Collect the maximum amount of validated learning about the model’s effectiveness: Develop a hypothesis and test it. This validated learning confirms or denies your team’s initial hypotheses. It has two main parts:
- Is the model technically performing better than the baseline?
- Is the model able to make a meaningful impact to the underlying business problem?
- Least effort: Full-fledged deployments are typically costly and time-consuming. Therefore, find the simplest way to get the model out.
How do you build the Minimal Viable Model?
There are too many modeling approaches to cover here but they can broadly fall into two phases.
1. “Lab” Validation: Does the model work in a controlled environment?
For example, split historical data into training, validation, and test sets. Train algorithms on the training data, measure its effectiveness on the validation set, and do a final check on the validation data.
After probably several attempts, comes the trickier part…
2. “In the Wild” Validation: Just because the data scientists validate a model on their computers or in the cloud, doesn’t mean it will work in the wild. Rather, see how the model actually performs in the real world.
You might see how the model performs on current data, without actually intervening in the underlying problem. Or you might do a simple deployment to the stakeholders. This could be a basic Shiny or Flask app, a lookup tool in Excel, or a daily report.
Often you split subjects into two groups — A test group (impacted by the model) and a control group (not impacted). Then measure whether those in the test group perform differently from those in the control group.
Regardless of how the model is developed, the idea is to vet out three things:
- Does the model perform better than baseline performance?
- Is it worthwhile to deploy the model in production?
- Are there any unintended consequences?
Possible next steps include:
- Shift the model’s focus onto a different topic (i.e. loop back to “Problem Definition”)
- Add in more or cleaner data sets (i.e. loop back to “Data Investigation and Cleaning”)
- Experiment with different learning algorithms (i.e. re-start the “Minimal Viable Model” phase)
- Deploy and enhance the model (i.e. proceed to the next phase)
- Or if things just aren’t looking good after several iterations, cancel the project and see if you can provide value in some other way.
The ACME Minimal Viable Model
At ACME, your data scientists get busy modeling voluntary employee churn in the Dynamite Division.
Although they got access to the additional requested data sets, they would need weeks to clean up and integrate these with the existing five data sets. To get something out quickly, they proceed just with the existing five sets.
They randomly split the data into three groups: a training group, a validation group, and a test group. Then, they train a model on the employees of the training group by using data up until six months ago. And then, they see if the model can accurately predict voluntary churn over the previous six months.
The best model performed significantly better (statistically and managerially) than the baseline model which was an HR analyst ran in Excel.
To test it in the wild, the data scientists take the most current data and score each current employee in the Dynamite Division with a likelihood to churn. Instead of waiting six months, they found sufficient evidence that the model was doing its job predicting voluntary churn after just three months.
The HR Director is impressed! You have now validated your model in practice. Now it’s time to deploy the model in production and to enhance its performance.
IV. Deployment and Enhancements
Many data science life cycles include “Deployment” or a similar term. This step creates the delivery mechanism you need to get the model out to the users or to another system. It’s key because:
“No machine learning model is valuable, unless it’s deployed to production.”Luigi Patruno at ML in Production
This step means a lot of different things for different projects. It could be as simple as getting your model output in a Tableau dashboard. Or as complex as scaling it to the cloud to millions of users.
Any short-cuts taken in earlier the minimal viable model phase are upgraded to production-grade systems.
Typically the more “engineering-focused” team members such as data engineers, cloud engineers, machine learning engineers, application developers, and quality assurance engineers execute this phase.
The “Enhancements” portion is not as common in other data science life cycle frameworks. But I’m trying to emphasize the importance of getting something basic out and then improving it.
Recall that the previous step delivered the “Minimal Viable Model”. While great as a starting point, the model probably isn’t as good as it should be. So use the time that the engineers need to deliver the model to improve the models. Conceptually, this “Enhancements” phase means to:
- Extend the model to similar use cases (i.e. a new “Problem Definition” phase)
- Add and clean data sets (i.e. a new “Data Investigation and Cleaning” phase)
- Try new modeling techniques (i.e. developing the next “Viable Model”)
Deployment and Enhancements at ACME
Together, you, the HR systems engineers, and the HR stakeholders identify that the best way to get the model out is to build an application programming interface (API). The API integrates with the HR systems to communicate the likelihood of each employee churning in their human resources profile.
Because an employee’s churn score typically does not fluctuate much hour-to-hour, you all agree to score each employee every night. Your cloud engineer gets to work to make this happen in coordination with a software lead from the HR systems team.
Meanwhile, the HR Director says that a company-wide model is more important than improving the model specific to just the Dynamite Division
Therefore, the data scientists start experimenting with a few new modeling techniques to generalize the model to the broader universe of employees. Meanwhile, your data engineer starts to integrate new data sets that the scientists can use.
The results are promising and, after some additional software testing, the HR Director gives the approval to turn the APIs on for the human resources systems.
V. Data Science Ops
Most other data science life cycles end with a Deployment phase or even before that with Assessment.
However, as data science matures into mainstream operations, companies need to take a stronger product focus that includes plans to maintain the deployed systems long-term. There are three major overlapping facets of management to this.
A productized data science solution ultimately sits as part of a broader software system. And like all software systems, the solution needs to be maintained. Common practices include:
- Maintaining the various system environments
- Managing access control
- Triggering alert notifications for serious incidents
- Executing test scripts with every new deployment
- Meeting service level agreements (SLAs)
- Implementing security patches
To help bridge the development to production gap, organizations are rapidly adopting a DevOps culture which includes general principles that can also guide many aspects of data science operations.
However, software system maintenance is a necessary but not sufficient for data science. There’s a broader (and trickier) set of considerations…
Model and Data Management
Data science product operations have additional considerations beyond standard software product maintenance:
- Monitor the Data: The data comes from the “real world” which is beyond your control and presents unique challenges. Therefore, validate that the incoming data sets are of expected format and that the data comes in acceptable ranges.
- Monitor Model Performance: Software functionality tends to be binary — it works or it doesn’t. However, models are probabilistic. So you often can’t say definitively whether the model “is working”. However, you can get a good feel by monitoring model performance to check against unacceptable swings in core metrics such as standard deviation or mean average percent error.
- Run A/B Tests: Models can drift to become worse than random noise. They can also (nearly) always be improved. Therefore, during operations, continue to routinely hold-out small portions of your population as a control group to test performance against the running model. Occasionally, develop and deploy new test models to measure their performance against the incumbent production model.
- Ensure Proper Model Governance: Regulations in certain industries require companies to be able to explain why a model made certain decisions. And even if you’re not in one of these regulated industries, you will want to be able to trace the specific set of data and the specific model used to evaluate specific outcomes.
On-going Stakeholder Management
I’m not aware of another published data science life cycle that calls this out specifically. However, in reality, on-going stakeholder management is critical to your product’s success.
Continue to educate your stakeholders and set expectations that the model isn’t magic. To drive adoption, communicate realistic benefits and if needed, provide training to end users. Likewise, warn stakeholders of the risks and shortcomings of the models and how to mitigate these.
Data Science Ops at ACME
Model deployed! Mission accomplished?
You know better. But many stakeholders do not which is why you continually educate them about the model and its implications.
Meanwhile, your data scientists toil to generalize the model across all departments. However, the results come up short and you don’t release this model yet as they work to improve it.
Fortunately, the model specific to the Dynamite team does its job and accurately predicts voluntary churn.
However, HR and managers struggle to find effective ways to mitigate the churn from the at-risk employees. HR asks if your team can identify effective intervention customized to each at-risk employee.
And thus, you kick off another iteration of the data science life cycle…
What are other popular Data Science Life Cycles?
The above generic life cycle is one of the dozens (hundreds?) you can find on-line. We’ll explore some of the more popular ones.
Data Mining Life Cycles
These three classic data mining processes have been thrown under the general umbrella of data science life cycles. All of them hail from the 90s. These tend to be more myopic. Specifically, the KDD Process and SEMMA focus on the data problem and not the business problem. Only CRISP-DM has a deployment phase. None of them have an operations phase.
- Knowledge Discovery in Database (KDD) Process: This is the general process of discovering knowledge in data through data mining, or the extraction of patterns and information from large datasets using machine learning, statistics, and database systems.
- SEMMA: SAS developed Sample, Explore, Modify, Model, and Assess (SEMMA) to help guide users through tools in SAS Enterprise Miner for data mining problems.
- CRISP-DM: The CRoss Industry Structured Process for Data Mining is the most popular methodology for data science and advanced analytics projects. It has six steps: Business Understanding, Data Understanding, Data Preparation, Modeling, Validation, and Deployment. It is broader-focused than SEMMA and the KDD Process but likewise lacks the operational aspects of a data science product life cycle.
Modern Data Science Life Cycles
The below life cycles are more modern approaches that are specific to data science. Like the data mining processes, OSEMN is more focused on the core data problem. Most others, especially Domino’s, tend to focus on the fuller solution.
- OSEMN: Standing for Obtain, Scrub, Explore, Model, and iNterpret, OSEMN is a five-phase life cycle. Go to this Towards Data Science post to learn more.
- Microsoft TDSP: The Team Data Science Process combines many modern agile practices with a life cycle similar to CRISP-DM. It has five steps: Business Understanding, Data Acquisition and Understanding, Modeling, Deployment, and Customer Acceptance.
- Domino Data Labs Life Cycle: This life cycle is perhaps most similar to my generic life cycle, in part because it includes a final operations stage. Its six steps are: Ideation, Data Acquisition and Exploration, Research and Development, Validation, Delivery, and Monitoring.
- Lesser-Known Life Cycles: Jeff found several interesting but lesser-known life cycles described in various blog posts. See his post on Data Science Workflows to learn more.
There are numerous data science life cycles to choose from. Most communicate the same basic steps necessary to deliver a data science project but often have a distinct angle.
The angle of this life cycle stresses the need for agility and the broader data science product life cycle.
Regardless of the life cycle you use, combine it with a collaboration process so that your team can effectively coordinate with each other and stakeholders.
Good luck. This journey is a challenge. But it can be fun. Have a blast in your next data science project!