The Need for Data Science Methodologies and Frameworks
The field of data science has matured greatly in the past decade. And yet, teams often struggle to apply an appropriate data science methodology and team-based collaboration framework. Consider the following three issues:- Production Models: A one-off model often does not provide sustained value. Rather organizations often need a sustainable productized system that delivers model results over a longer period of time. This necessitates a solid data science methodology that expands beyond just model development and into machine learning operations.
- Team Approach: Data science is increasingly becoming a team sport. The concept of a back office unicorn data scientist working in isolation is not the norm. But rather, many projects have a diverse team consisting of multiple team roles. Thus, teams need to leverage a modern team-based approach to coordinate their work.
- Agile Approach: Data science is a highly iterative process — especially once you extend beyond the classroom and into the real-world with real-time changing market conditions, technological shifts, and ever-evolving business needs. Long and inflexible upfront planning processes won’t work. Rather, this gives rise to agile approaches. But how can you effectively apply Agility to data science?
Ad Hoc Approaches
Ad hoc processes focus on delivering a specific implementation without concern for broader impact or repeatable processes. In short, you can just “wing it”. This approach may work well for one-off, smaller, and low-impact projects. Think of a toy side project or an academic exercise. Yet, the appropriate use cases for ad hoc in the real world are becoming less frequent. Unfortunately, many people still just result to Ad Hoc.Approach | Description | Strengths | Challenges | Best For… |
---|---|---|---|---|
Ad Hoc | Just do it! | • Quick to start | • Not scalable • High rework risk • Difficult for teams | • Small one-off project by one person |
Data Science Methodologies
A data science life cycle (also known as a data science methodology) describes the step-by-step approach you take to deliver a project. Data scientists (even if they have not explicitly studied various methodologies) intuitively understand these steps. Documenting them can help increase repeatability and prevent you from forgetting a step. This is increasingly important in the world of distributed teams that extend beyond data science to areas such as legal or business. There are dozens of different defined data science methodologies. This guide explores the most well-known.Approach | Description | Strengths | Challenges | Best For… |
---|---|---|---|---|
Waterfall | Plan your work. Work your plan | • Easily understood • Matches traditional corporate culture | • Inflexible • Delays testing • Documentation heavy | • Avoid for data science |
KDD | 5 Phases from Selection to Evaluation | • Decent explanation of core data mining technical project | • Outdated • Ignores teams • Many same shortcomings as Waterfall • Ignores biz understanding & deployment | • “Toy” projects with a well-defined scope that don’t need productized |
SEMMA | 5 Phases from Sample to Assess | • Decent explanation of core data mining technical project | • Outdated • Ignores teams • Many same shortcomings as Waterfall • Ignores biz understanding & deployment | • “Toy” projects with a well-defined scope that don’t need productized |
CRISP-DM | 6 Phases from Business Understanding to Deployment | • Well-known • More comprehensive than KDD, SEMMA • Defined guide | • Outdated • Ignores teams • Many same shortcomings as Waterfall | • Teams looking for an established practice |
TDSP | Combines CRISP-DM and Scrum practices | • Comprehensive open-source documentation | • Includes Agile concepts • Strong team focus | • Teams looking to “modernize” CRISP-DM |
Domino | Combines CRISP-DM and Agile practices | • Visual roadmap with clear flow and decision points • Includes practical tips | • More of a concept as opposed to a fully vetted approach | • Teams looking to “modernize” CRISP-DM |
Others | Lesser-known life cycles | • Each includes a novel viewpoint | • Not well-known or vetted | • Good “food for thought” |
Agile Coordination Frameworks
Agility has taken over the software engineering world. Yet, it gets a mixed review for data science.
However, Agile and data science should go hand-in-hand. Don’t focus too much on the specific approach, but rather start with the fundamental principles your team aspires. From there, build a framework on top of it that defines how you can sustain team collaboration while also being flexible enough to shift the project’s focus.
Here are three agile frameworks that you can consider. Kanban is borrowed from manufacturing. Scrum from software. And Data Driven Scrum was designed specifically data science.
Approach | Description | Strengths | Challenges | Best For… |
---|---|---|---|---|
Kanban | Visualize flow. Minimize work-in-progress. | • Simple • Combines well with other frameworks •Maximizes throughput • Minimizes waste |
• Least definitive • Lots of ambiguity |
• Starting with a solid core set of principles and building a framework on top of it |
Scrum | Well-known Agile approach focused on fixed-length iterations | • Quick, incremental value focus • Well-defined feedback loop • Strong team focus |
• Time-boxing can be restrictive • Often poorly implemented • Management might get in the way |
• Agile teams who need discipline provided by fixed time cycles • Radical innovation cultures |
Data Driven Scrum | Agile framework specifically designed for data science teams | • Most of same benefits of Scrum and Kanban •Caters to experimentation • Relaxes Scrum pain points |
• Not as vetted as Scrum • Adds challenges of managing concurrent iterations |
• Teams with strong experimental culture • Data science teams that struggled with Scrum |
Hybrid Approaches
You can mix and match various approaches to design a comprehensive methodology that best suits your team, projects, and organizational needs.
This guide highlights two such hybrid approaches — each serving different use cases.
Approach | Description | Strengths | Challenges | Best For… |
---|---|---|---|---|
Waterfall-Agile | Attempts to combine best of Agile and waterfall | • Allows for some flexibility while catering to broader constraints | • “Best of both worlds” can water down the advantages from either | • Highly-regulated projects that require rigid administrative processes |
Research & Development | Combines open-research phases followed by structured development | • Gives flexibility for open-ended research • Adds structure when needed to coordinate deliverables |
• Difficult to monitor • Can suffer from ad hoc chaos • High trust and discipline required |
• Mature teams who don’t need heavy oversight • Research-focused teams needing freedom |