A data science process can make or break a team.
Indeed, we see time and time again that many of the reasons behind data science project failures are not technical in nature but rather stem from process-related issues. Simply throwing compute power and PhDs at the problem doesn’t work.
Rather, having the right mix of people, data, technology, and processes are all key ingredients in the recipe to successfully execute data science projects.
Today, we’ll focus on the process ingredient. Namely, a data science process is a set of guidelines that defines how a team should execute a project. These guidelines should cover both:1) the steps in the project life cycle and 2) the protocols for coordinating work as a team
What is the Most Common Data Science Process?
We asked you – our readers – this question in a poll in August and September 2020. CRISP-DM was by far the most common response.
This is consistent with similar surveys done by other organizations in years past.
What is CRISP-DM?
The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model with six phases.
- Business understanding – What does the business need?
- Data understanding – What data do we have / need? Is it clean?
- Data preparation – How do we organize the data for modeling?
- Modeling – What modeling techniques should we apply?
- Evaluation – Which model best meets the business objectives
- Deployment – How do stakeholders access the results?
Is CRISP-DM an Effective Data Science Process?
Although CRISP-DM sufficiently describes the data science life project cycle, it falls short of being an effective comprehensive data science process. Specifically, CRISP-DM:
- Is documentation-heavy and misses a lot of Agile Principles
- Does not include a collaboration framework
- Excludes post-deployment processes
As such, let’s explore an alternative data science process that addresses these shortcomings.
A Comprehensive Data Science Process
This data science process builds on what works for CRISP-DM while expanding its focus to include modern Agile practices, effective team collaboration, and post-deployment maintenance. It combines a CRISP-DM inspired life cycle with six phases (each with 3-5 steps) with an agile collaboration framework called Data Driven Scrum.
The Data Science Process – Life Cycle
I. Ideate
Have you ever beautifully delivered on a project, just to find out that your project outcome doesn’t solve any problems or create new opportunities? I know I’ve been there.
While we can’t eliminate this from happening, we can greatly mitigate this risk by with the following four steps:
- Identify project idea – Generate numerous ideas and hone into a specific problem or opportunity.
- Define project goals – Develop a high-level understanding of the business problem and define measurable business and technical goals for the prioritized idea. Read the 10 Data Science Project Metrics post for details.
- Develop high-level project plan – Identify the people/roles, technology, and data needed. Set up a project backlog and roadmap. And assess risks and mitigation plans. See the Data Science Project Checklist as a good starting point.
- Kickoff – Have a defined project go/no-go decision. If a “go”, then host a kickoff session with all relevant stakeholders to communicate the project purpose and plan.
II. Explore
The team needs to gain a deep understanding of the business problem and the data. Without this, you might blindly head down a path that will take you nowhere. Jump back to the Ideate phase if Explore doesn’t produce a promising outlook.
This is often the most extensive phase of a data science project.
- Get data – Do you have the data? Get access to it. Can you generate the data? Start capturing it. Can you buy the data? Negotiate to buy it. Whichever route is necessary, the objective is to get the appropriate quality and quantity of the right type of data.
- Explore data – Develop descriptive statistics to get a sense of the data’s “look and feel”. Assess whether the data is clean (it probably isn’t) and whether/how to prepare the data for use.
- Engineer dataset – Develop datasets ready for modelling. This involves data cleaning and the construction of new datasets. If the project requires a feed that updates, develop the initial stages of a data pipeline.
- Share and discuss data insights – Uncover interesting insights that might provide early project value to the stakeholders. Develop artifacts such as dashboards and slides to facilitate stakeholder interaction.
- Refine goals and plan – Based on these early conversations and potential discoveries, refine the project goals and plan.
III. Model
This is the core machine learning life cycle. Ironically, for some projects, this might also be the shortest phase. Data scientists often jump back to II. Exploration multiple times before advancing beyond this phase.
- Engineer features – Identify, define, and refine features that can be used by the model. This may overlap with Engineer dataset (Phase II, Step 3).
- Train offline model – Split the data into training, validation, and test sets. Train the model on the training data and assess its performance on the test set.
- Evaluate offline model – Assess the model’s performance based on pre-defined evaluation criteria. Consider standard metrics like R2 or RMSE, model processing time and costs, and business metrics. Discuss the business impact with stakeholders to assess whether the offline model is of sufficient performance to proceed toward validation.
- Select top performing model(s) – Select the most appropriate model based on the model performance, project constraints, and project goals.
If this is the first time through the broader life cycle loop, you may need to set up a minimally viable deployment mechanism – typically a simplified version of Step 1 from the Deploy phase. Then proceed to Validate.
IV. Validate
Simply achieving high accuracy on an offline test set in the Model phase isn’t enough. Rather you want to validate whether the model performs on live data and whether the overall system satisfies the business needs. Therefore, the Validate phase releases the model “out into the wild” – typically as a limited test. Here, we follow the scientific method.
- Design experiment – Define your hypothesis and how you will test it. A common setup is to randomly assign users/systems into an experimental group whose journeys are defined in part from the model and a holdout group who are not impacted by the model.
- Run experiment – Per Tom Petty, “Waiting is the hardest part”. But be patient. Monitor the preliminary results to ensure the experiment is running as expected but don’t focus too much on interpretation until the experiment runs its course.
- Evaluate performance – Analyze both the technical model performance and the impact of business metrics. Ensure the experimental results are valid.
- Determine next steps – Did the model successfully drive business outcomes? What additional unintended factors cropped up as a result of the experiment? Often the first pass won’t meet expectations, and you may need to loop back to prior phases.
V. Deploy
Most models need to be deployed in production to provide value. You should already know many of the system development requirements from the earlier planning. Depending on the project urgency and staff availability, you might execute much of the Deploy phase in parallel with the prior two phases.
- Build core deployment pipeline – Develop a production-grade system that executes the full end-to-end machine learning system. This may include data capture, data processing, feature engineering, modeling, and services to make the model scores readily available.
- Build ML operations systems – As if the above wasn’t enough, data sometimes will not load, the data will drift, the model performance will degrade, and sometimes the serving systems just crash. As such, build the necessary set of general software and ML-specific systems to help you keep the model and supporting systems running.
- Facilitate change management – This step is absent from the other defined data science process life cycles but is often critical. Humans will be impacted in some way by most models. And change is hard. As such, educate your stakeholders on how the model will impact their jobs or lives and help them get comfortable to adopt to these changes.
VI. Operate
Many data science process life cycles close at a deployment or even earlier at a validation phase. However, some extend beyond the core project and include a monitoring phase. Yet, “monitoring” is a passive term and sometimes you need to actively intervene to improve the model performance or broader system. As such, I label this phase as “operate”.
- Monitor model and systems – Observe through your model monitoring systems how the model and the overall supporting ML operational systems are performing. Push notifications should alert you when anything has gone awry. You should also occasionally actively study performance for more subtle issues. Audit model outcomes to ensure compliance with regulations and to avoid possible ethical violations.
- Maintain software systems – Just like any software system, you will need to make security updates, patch buggy systems, improve laggy service performance, and occasionally upgrade systems to take advantage of newer technologies.
- Retrain model as needed – Your model is most performant right after you deploy it. It will need tweaking over time. You could retrain on a set cadence or whenever the model accuracy dips below a certain threshold. Consider shadow deployments to safely test a challenger model against the champion. Be ready to roll back a model if it is misbehaving.
Read the ML Operations post for more detail.
The Data Science Process – Coordination Framework
Most well-known data science processes cover just a life cycle like the above. That might suffice for individual or academic projects. However, data science is increasingly becoming a team sport. And a modern data science team needs to define both its project life cycle and its coordination framework. Added together, the life cycle and coordination framework create a comprehensive data science process that can guide the team toward successful project execution.
What is a Collaboration Framework?
Whereas the life cycle defines the steps necessary to complete a project, the coordination framework defines how the team coordinates these various steps. Specific aspects include:
- Principles and values – What overarching themes does the team ascribe to?
- Team Roles – Who is responsible for which tasks?
- Meetings – Who meets, at what cadence, for what purpose, and for how long?
- Artifacts – How does the team represent items such as backlogs, Kanban boards, and roadmaps?
- Cadences – How frequently does the team cycle through a product increment or project phase?
- Other protocols – What are the “rules of engagement” that the team follows for specific situations? Examples may be Git procedures, conflict resolution tactics, or prioritization practices.
What are some Example Frameworks?
There are several. However, an effective data science process is built on top Agile principles that focus on small, incremental deliverables and quick planning cycles that can flex to the ever-changing business landscape. Three common agile collaboration frameworks for data science teams include:
- Kanban – Simple, lightweight process centered on a highly visible board which describes the current flow of work
- Scrum – A popular software development framework based on fixed-time product releases
- Data Driven Scrum – A variant of Scrum designed specifically for data science with capability-based iterations
What is Data Driven Scrum?
While other collaboration options can work well, this data science process combines the life cycle from above with Data Driven Scrum because it is specifically designed as an agile data science process. It is a variant of Scrum with flexible-length and possibly overlapping iterations. It defines 4 principles, 3 artifacts, 3 roles, and 5 events.
4 Principles
DDS generally follows the principles of the Agile Manifesto. It also explicitly defines four principles.
- Agile is a sequence of iterative experimentation and adaption cycles
- Each iteration has an idea or experiment with defined evaluation metrics to assess the results
- Each iteration should go from an initial idea, through implementation, and to the analysis
- The iteration closes with the end of the empirical process
3 Artifacts
- Product Backlog Item – Specific requests to create, observe, and analyze (usu. potential questions or hypotheses to answer)
- Item Backlog – Prioritized wishlist of Product Backlog Items
- Task Board – A visual Kanban representation of upcoming, in-progress, and recently completed work items
3 Roles
- Product Owner – The “voice of the client” who defines product increments, brainstorms and prioritizes the Product Backlog Items
- Process Expert – Facilitates the overall DDS process
- Data Science Team Members – Members such as data scientists and software engineers who work to create artifacts (ex. models) to answer the questions / experiments
5 Events
- The Iteration – Collection of one or more Product Backlog Items combined into a single experiment or question
- Backlog Item Selection – Develop the new iteration plan
- Daily Meeting – 15-mnute “standup” to plan today’s work
- Iteration Review – Occurs on a regular basis (e.g. weekly) to review and demo the work “done”
- Retrospective – Process improvement session. Occurs on a regular cadence (e.g. monthly)
The Data Driven Scrum Process
Wrap Up
To fully execute projects, a data science team should combine both a data science process life cycle and an agile collaboration framework. There are several ways to successfully do this. This post defined a life cycle that leveraged a variant of CRISP-DM and Data Driven Scrum.
However, there’s no one-size-fits-all approach. Rather the data science process your team implements should be unique to your team, broader environment, and the types of projects you’re executing. Consider two extreme examples. A 12-member team at a highly regulated pharmaceutical company might use a process that looks very different from that of a small marketing tech startup team with two data scientists. Yet, both should have some common foundational elements that effectively combine the data science life cycle and an agile collaboration framework.
Learn More
The content in this article is inspired by multiple sources:
- The CRISP-DM Guide
- Domino Data Labs Process
- Microsoft’s Team Data Science Process
- Post on effective data science processes