What is a Data Science Life Cycle?
A data science life cycle defines the phases (or steps) in a data science project. Using a well-defined data science life cycle is useful in that it provides a common vocabulary (and shared mental model) of the work to be done to do a data science project.
Commonly Used Data Science Life Cycles
The most commonly used data science project life cycle is CRISP-DM, which was defined in the 1990’s and defines six project phases (Business understanding, Data understanding, Data preparation, Modeling, Evaluation, and Deployment).
A more recent framework, the Team Data Science Prcoess (TDSP) framework, describes five higher-level project phases (and added customer acceptance).
Jeff covers other lessor-known frameworks in his post on data science workflows.
In this post, we focus on Domino’s life cycle.
What is the Domino Data Science Life Cycle?
Data science life cycles fall into two broad categories.
Data Mining Life Cycles: Stemming from the 1990s are the traditional data mining life cycles such as KDDS, SEMMA, and CRISP-DM that tend to focus narrowly on the core data and modeling aspects of the project life cycle.
Modern Data Science Life Cycles: More recently, organizations have started to define their own frameworks that are more modern takes to the general project life cycle in that they:
- integrate agile principles and practices
- acknowledge the multiple team roles of a data science project
- extend the core data and modeling phases to first focus on the business problem and to finish with deployment or even operations (note of the data mining processes, CRISP is the only one that touches on these topics).
The Domino Data Science Life Cycle is a modern life cycle approach. Domino Data Lab, a Silicon Valley vendor that provides a data science platform, crafted its data science project life cycle framework in a 2017 whitepaper. The paper wraps its life cycle around goals, challenges, diagnoses, system recommendations, and role definitions. We’ll focus on the core life cycle.
Data Science Life Cycle Training
Overall Life Cycle Principles
Domino’s data science life cycle is founded on three guiding principles:
- “Expect and embrace iteration” but “prevent iterations from meaningfully delaying projects, or distracting them from the goal at hand”
- “Enable compounding collaboration” by creating components that are reusable in other projects
- “Anticipate auditability needs” and “preserve all relevant artifacts associated with the development and deployment of a model”
Six Stages of the Methodology
The core life cycle itself splits a project into six iterative stages that mirror those of CRISP-DM.
The initial phase puts the “problem first, not data first” by defining the underlying business problem and conducting business analysis activities such as current state process mapping, project ROI analysis, and upfront documentation.
It also incorporates common agile practices including developing a stakeholder-driven backlog and creating deliverable mockups. IT and engineering are looped in early and models might be baselined with synthetic data. The phase ends with a project kick-off.
Ideation mirrors the business understanding phase from CRISP-DM.
II: Data Acquisition and Exploration
Data science teams should identify data sources with help from stakeholders who can provide leads based on their intuition. Decisions are made to capture data or buy data from vendors. Exploratory data analysis is conducted, and the data is prepared for both the current project modeling and as re-usable components for future projects.
This phase incorporates many elements from the data understanding and data preparation phases of CRISP-DM.
III: Research and Development
Similar to the core modeling phase of CRISP-DM or any other data science process, this phase iterates through hypothesis generations, experimentation, and insight delivery.
True to agile principles, Domino recommends starting with simple models, setting a cadence for insight deliveries, tracking business KPIs, and establishing standard hardware and software configurations.
This phase focuses on both business and technical validations and loosely mirrors the evaluation phase from CRISP-DM. True to its principle to “enable compounding collaboration”, Domino stresses the importance of ensuring reproducibility of results, automated validation checks, and documentation. The main goal of this phase is to “ultimately receiving sign-off from stakeholders”.
This is when models become products. Deployment, A/B testing, test infrastructure, and user acceptance testing, similar to those of any software project, are in this phase. Domino recommends additional considerations such as preserving links between deliverable artifacts, flagging dependencies, and developing a monitoring and training plan.
The deployment phase of CRISP-DM is split between this phase and the last one.
Given models’ non-deterministic nature, Domino recommends monitoring techniques that extend beyond standard software monitoring practices. For example, consider using control groups in production models so that you can continually monitor model performance and value creation to the organization. Moreover, automatic monitoring of acceptable output ranges can help identify model issues before they become too pervasive.
Evaluation and Comparison
Domino Life Cycle vs. CRISP-DM
Domino’s model is not as prescriptive as CRISP-DM in defining a technical methodology that spells out each individual step but rather is more informative to guide a team toward better performance.
It incorporates a team-based approach that overcomes one of the major shortcomings of CRISP-DM that implicitly assumes the project is executed by an individual or small team.
Moreover, its modern view (2017) provides several guidelines that were not conceptualized in CRISP-DM (1999). Most notably, it leverages several agile practices such as short iterative deliveries, close stakeholder management, and a product backlog that are commonplace today.
Domino Life Cycle vs. TDSP
Perhaps the most well-known modern data science life cycle is Microsoft’s Team Data Science Process.
Compared to Microsoft, Domino does not provide the sets of re-usable templates and various artifacts that are on Microsoft’s Git Hub pages. However, this may not be a major factor as most teams would have to customize Microsoft’s artifacts anyway.
More importantly, Domino takes a more comprehensive approach in dedicating the end of its life cycle toward operations focus which is mentioned but not extensively detailed in Microsoft’s life cycle.
Domino’s life cycle should not be viewed as mutually exclusive with CRISP-DM or Microsoft’s TDSP; rather its “best practices” approach with “a la carte” elements could augment these or other methodologies as opposed to replace them.
Ad hoc teams or other teams with broken project management processes could use Domino’s approach as a good starting point to conceptualize a comprehensive modern project management methodology that effectively integrates data science, software engineering, and agile approaches.
However, the guide provides several straight forward recommendations that any team could benefit from.
And whether you adopt a Domino-inspired life cycle or another life cycle, be sure to include a collaboration framework to help guide effective team communication.
Lessons from 20 Data Science Team: I interviewed Mac Steele from Domino’s product team who authored much of the whitepaper. Learn his insights in this post.
Data Science Process Alliance: Jeff and I have had a lot of requests for training which is why we’ve helped launch the DSPA which offers Certifications and Consulting Services to improve data science outcomes. Selection and training for data science workflows is a major emphasis.
<Previous: Microsoft Team Data Science Process