What is a Data Science Life Cycle?
A data science life cycle defines the phases (or steps) in a data science project. Using a well-defined data science life cycle is useful in that it provides a common vocabulary (and shared mental model) of the work to be done to do a data science project.
Commonly Used Data Science Life Cycles
The most commonly used data science project life cycle is CRISP-DM, which was defined in the 1990’s and defines six project phases (Business understanding, Data understanding, Data preparation, Modeling, Evaluation, and Deployment).
A more recent framework, the Team Data Science Prcoess (TDSP) framework, describes five higher-level project phases (and added customer acceptance).
Jeff covers other lessor-known frameworks in his post on data science workflows.
In this post, we focus on Domino’s life cycle.
Data Science Life Cycle Training
What is the Domino Life Cycle?
Overall Life Cycle Principles
The methodology is founded on three guiding principles:
- “Expect and embrace iteration” but “prevent iterations from meaningfully delaying projects, or distracting them from the goal at hand”
- “Enable compounding collaboration” by creating components that are reusable in other projects
- “Anticipate auditability needs” and “preserve all relevant artifacts associated with the development and deployment of a model”
Six Stages of the Methodology
Domino’s life cycle splits a project into six iterative stages that mirror those of CRISP-DM.
The initial phase puts the “problem first, not data first” by defining the underlying business problem and conducting business analysis activities such as current state process mapping, project ROI analysis, and upfront documentation. It also incorporates common agile practices including developing a stakeholder-driven backlog and creating deliverable mockups. IT and engineering are looped in early and models might be baselined with synthetic data. The phase ends with a project kick-off. Ideation mirrors the business understanding phase from CRISP-DM.
II: Data Acquisition and Exploration
Data science teams should identify data sources with help from stakeholders who can provide leads based on their intuition. Decisions are made to capture data or buy data from vendors. Exploratory data analysis is conducted, and the data is prepared for both the current project modeling and as re-usable components for future projects. This phase incorporates many elements from the data understanding and data preparation phases of CRISP-DM.
III: Research and Development
Similar to the core modeling phase of CRISP-DM or any other data science process, this phase iterates through hypothesis generations, experimentation, and insight delivery. Domino recommends starting with simple models, setting a cadence for insight deliveries, tracking business KPIs, and establishing standard hardware and software configurations.
This phase focuses on both business and technical validations and loosely mirrors the evaluation phase from CRISP-DM. True to its principle to “enable compounding collaboration”, Domino stresses the importance of ensuring reproducibility of results, automated validation checks, and documentation. The main goal of this phase is to “ultimately receiving sign-off from stakeholders”.
This is when models become products. Deployment, A/B testing, test infrastructure, and user acceptance testing, similar to those of any software project, are in this phase. Domino recommends additional considerations such as preserving links between deliverable artifacts, flagging dependencies, and developing a monitoring and training plan. The deployment phase of CRISP-DM is split between this phase and the last one.
Given models’ non-deterministic nature, Domino recommends monitoring techniques that extend beyond standard software monitoring practices. For example, consider using control groups in production models so that you can continually monitor model performance and value creation to the organization. Moreover, automatic monitoring of acceptable output ranges can help identify model issues before they become too pervasive.
Evaluation and Comparison
Presented in a practical 25-page whitepaper as a series of coordinated best practices, Domino’s data science life cycle is significantly easier to read than CRISP-DM’s small-print 76 page paper. Domino’s model is not as prescriptive as CRISP-DM in defining a technical methodology that spells out each individual step but rather is more informative to guide a team toward better performance. It incorporates a team-based approach that overcomes one of the major shortcomings of CRISP-DM that implicitly assumes the project is executed by an individual or small team. Moreover, its modern view (2017) provides several guidelines that were not conceptualized in CRISP-DM (1999). Most notably, it leverages several agile practices such as short iterative deliveries, close stakeholder management, and a product backlog that are commonplace today.
Domino’s life cycle should not be viewed as mutually exclusive with CRISP-DM or Microsoft’s TDSP; rather its “best practices” approach with “a la carte” elements could augment these or other methodologies as opposed to replace them. However, ad hoc teams or other teams with broken project management processes could use Domino’s approach as a good starting point to conceptualize a comprehensive modern project management methodology that effectively integrates data science, software engineering, and agile approaches.
Lessons from 20 Data Science Team: I interviewed Mac Steele from Domino’s product team who authored much of the whitepaper. Learn his insights in this post.
Data Science Process Alliance: Jeff and I have had a lot of requests for training which is why we’ve helped launch the DSPA which offers Certifications and Consulting Services to improve data science outcomes. Selection and training for data science workflows is a major emphasis.
<Previous: Microsoft Team Data Science Process