The first two posts in this four-part post series explored traditional and agile project management approaches while the previous post examined two general hybrid approaches that combine traditional and agile. None of the seven approaches explored in previous posts are new, nor are they data science specific.
So are there new emerging approaches that are data science native?
Yes. There are several. Many of these are variants of CRISP-DM with the same shortcomings of being process methodology as opposed to full-fledged project management approach. However, three stand out.
Emerging Data Science Native Approaches
Microsoft’s Team Data Science Process (TDSP), Domino Data Lab’s Data Science Life Cycle, and the Data Science Process Alliance’s Data Driven Scrum (DDS) are approaches that are both data science native and agile. There are pros and cons specific to each approach but they share the following:
- Data Science Native: Unlike other somewhat modern approaches to project management, these are built for data science and do not attempt to shoe-horn the data science process into something it is not.
- Agile: Agility is key to ensure that the data scientists and stakeholders can iteratively map a solution to the underlying problem.
- Not always necessary: All of these approaches assume that a team is executing a mid- or large-sized project with an intent to deliver production-level code to stakeholders. If these conditions don’t apply, then other approaches might be more appropriate.
- Not recognized : It took over a decade or two for Scrum to gain the respect as a leading agile framework. Likewise, it might take time for these emerging approaches to be accepted, especially at organizations that manage data science as software.
Team Data Science Process (TDSP)
“TDSP helps improve team collaboration and learning. It contains a distillation of the best practices and structures from Microsoft and others in the industry that facilitate the successful implementation of data science initiatives.”
What is it?
Microsoft founded the Team Data Science Process (TDSP) in 2016 as “an agile, iterative data science methodology to deliver predictive analytics solutions and intelligent applications efficiently” (Microsoft, 2020).
The life cycle includes five phases:
- business understanding
- data acquisition and understanding
- customer acceptance
Sound familiar to CRISP-DM?
Although TDSP condenses four CRISP phases into two and adds in customer acceptance, the biggest difference comes from TDSP’s more full-fledged and modern approach. It recommends infrastructure, resources, tools, and utilities that leverage modern cloud-based systems and practices like Git version control. Moreover, it defines six project roles and uses common agile practices such as sprints and a backlog with features, user stories, tasks, and bugs.
- Comprehensive: More than just a process – complete with re-usable templates on GitHub, role definitions, and more.
- Optional Inclusion of Scrum: TDSP can, optionally, be used in conjunction with Scrum (where a sprint goes through all the phases).
- Maintained: Microsoft seems to update its guide and repository every few months.
- Some Inconsistency: Microsoft seems to sometimes forget to extend some of its updates to all of its documentation.
- Steep learning curve: Some teams find the comprehensive framework complicated to learn and contains too much structure (e.g., the detailed management roles and specific document templates).
- Microsoft Specific Aspects: While much of the framework is independent of the Microsoft technical stack, there are other parts of the framework, especially the aspects focusing on infrastructure, that specifically mention Microsoft products.
The bottom line
For those using Scrum with TDSP’s development process, the fixed length sprints are challenging and frustrating to many data scientists. However, TDSP is a solid option for teams that can work with fixed length sprint cadences and are looking for a comprehensive, modern, and agile approach.
Domino Data Science Lifecycle
“You should have IT, data science, and business involved from the beginning”
-Mac Steele, (former) Director of Product at Domino Data Lab
What is it?
As a modern approach combining CRISP-DM elements with agile approaches, Domino’s Data Science Lifecycle is conceptually similar to Microsoft’s TDSP. However, it has a more extensive process flow but fewer supporting resources. A 25-page whitepaper defines this approach, starting with organizational goals and ending with recommendations on how to scale a solution. The meat of it defines a project framework with six phases:
- Ideation: Defines the problem, scopes the project, and has a go/no go decision point
- Data Acquisition and Prep: Identifies existing data sets, explores the need to acquire new ones, and prepares the data sets for modeling
- Research and Development: Hypothesis testing and modeling
- Validation: Both business and technical validation
- Delivery: Deployment, A/B testing, and user acceptance testing
- Monitoring: Systems and model monitoring
- Data Scientist
- Data Infrastructure Engineer
- Data Product Manager
- Business Stakeholder
- Data Storyteller
- Based on “what works”: Domino built this process based on experiences with over 20 data science teams.
- Broad team definition: The team is not just a technical team but also involves business stakeholders and a product manger (why is one important?). Every relevant stakeholder (technical or business) participates in pre-project ideation.
- Flexible: Intended as an “a la carte” guide whose practices can be mixed and matched with other approaches.
- No fixed cadence: As Mac Steele, the former Director of Product at Domino, observed, data science doesn’t “magically happen in two-week blocks.”
- Less comprehensive: Compared to its cousin, TDSP, Domino’s process lacks reproducible templates and detailed definitions.
- Not updated: Domino defined this process in a one-off guide and has not been updated since.
- Team Coordination not defined: While the process suggests to do many iterations (through their phases), it is not clear how the team should decide what is “in” an iteration, and how to structure the dialog with the business stakeholder.
The bottom line
Ideal for teams that are not looking for a comprehensive approach but prefer to build up an approach using some common sense best practices (especially in terms of phases of a project).
Data Driven Scrum (DDS)
Disclaimer: Jeff co-founded this process so it’s hard to be un-biased in the evaluation.
What is it?
Scrum is the most common agile approach but tends to be unpopular among data scientists for various reasons. One key reason is that Scrum defines fixed-time sprints that aim to deliver potentially ship-able increments. Unfortunately, this can short-circuit the experimental nature of data science.
Data Driven Scrum (DDS) counters this by offering a familiar Scrum-like framework with some major exceptions:
- Iterations are variable-length, capacity-focused, and can overlap.
- Instead of delivering potentially shippable backlog items, the iterations focus on data science native concepts (e.g., experiments, questions to be answered). Each of these items are broken down into tasks to create, observe, and analyze the analysis.
- Backlog item selection occurs in a more continuous manner.
Beyond that, DDS is nearly an instantiation of Scrum. Teams of 3-9 members work off a backlog to deliver increments. A product owner manages the backlog, and a process master (Scrum master) facilitates the overall process.
DDS also defines four familiar meetings: daily meetings (daily standups), iteration reviews, product item selection (sprint planning), and retrospectives. Product item selection occurs when the team has capacity to start a new iteration (i.e., like the pull-based system of Kaban). All other meetings occur at regular intervals and are independent of iteration cadences.
Likewise, a Kanban board and WIP limits provide a visual way to communicate workflow and identify bottlenecks.
- Capacity-based iterations: Acknowledges the benefits of defined iterations but skips the inflexibility of hard deadlines to deliver potentially ship-able increments
- Fits with Scrum organizations: The similarities with Scrum extends beyond just the process name which makes DDS attractive to Scrum-friendly organizations.
- Flexible to various lifecycles: Not all data science projects follow the same CRISP-DM-like lifecycle. By avoiding a lifecycle definition, DDS can flex to different types of data science projects.
- Decoupling iterations from reviews and retrospectives: Divorcing reviews and retrospectives from the completion of an iteration enables short, frequent iterations while still maintaining a regular schedule for these ceremonies.
- Not comprehensive: On the other hand, the lack of lifecycle can be a detractor to teams who are looking for defined steps like those found in CRISP-DM, TDSP, or the Domino Data Science Lifecycle.
- Decoupling iterations from reviews and retrospectives (alternate view): This could make reviews and retrospectives seem stale when they occur.
The bottom line
Ideal for data science teams operating in a Scrum-based organization. Also a viable option for other data science teams who would like a defined structure without time-boxed iterations.