Data Science Methodologies and Frameworks Guide

Data Science Methodologies and Frameworks Guide

The Need for Data Science Methodologies and Frameworks

The field of data science has matured greatly in the past decade. And yet, teams often struggle to apply an appropriate data science methodology and team-based collaboration framework. Consider the following three issues:

  • Production Models: A one-off model often does not provide sustained value. Rather organizations often need a sustainable productized system that delivers model results over a longer period of time. This necessitates a solid data science methodology that expands beyond just model development and into machine learning operations.
  • Team Approach: Data science is increasingly becoming a team sport. The concept of a back office unicorn data scientist working in isolation is not the norm. But rather, many projects have a diverse team consisting of multiple team roles. Thus, teams need to leverage a modern team-based approach to coordinate their work.
  • Agile Approach: Data science is a highly iterative process — especially once you extend beyond the classroom and into the real-world with real-time changing market conditions, technological shifts, and ever-evolving business needs. Long and inflexible upfront planning processes won’t work. Rather, this gives rise to agile approaches. But how can you effectively apply Agility to data science?

Ad Hoc Approaches

Ad hoc processes focus on delivering a specific implementation without concern for broader impact or repeatable processes. In short, you can just “wing it”.

This approach may work well for one-off, smaller, and low-impact projects. Think of a toy side project or an academic exercise.

Yet, the appropriate use cases for ad hoc in the real world are becoming less frequent. Unfortunately, many people still just result to Ad Hoc.

ApproachDescriptionStrengthsChallengesBest For…
Ad HocJust do it!• Quick to start• Not scalable
• High rework risk
• Difficult for teams
• Small one-off project by one person

Data Science Methodologies

A data science life cycle (also known as a data science methodology) describes the step-by-step approach you take to deliver a project. Data scientists (even if they have not explicitly studied various methodologies) intuitively understand these steps. Documenting them can help increase repeatably and prevent you from forgetting a step. This is increasingly important in the world of distributed teams that extend beyond data science to areas such as legal or business.

There are dozens of different defined data science methodologies. This guide explores the most well-known.

ApproachDescriptionStrengthsChallengesBest For…
WaterfallPlan your work. Work your plan• Easily understood
• Matches traditional corporate culture
• Inflexible
• Delays testing
• Documentation heavy
• Avoid for data science
KDD5 Phases from Selection to Evaluation• Decent explanation of core data mining technical project• Outdated
• Ignores teams
• Many same shortcomings as Waterfall
• Ignores biz understanding & deployment
• “Toy” projects with a well-defined scope that don’t need productized
SEMMA5 Phases from Sample to Assess• Decent explanation of core data mining technical project• Outdated
• Ignores teams
• Many same shortcomings as Waterfall
• Ignores biz understanding & deployment
• “Toy” projects with a well-defined scope that don’t need productized
CRISP-DM6 Phases from Business Understanding to Deployment• Well-known
• More comprehensive than KDD, SEMMA
• Defined guide
• Outdated
• Ignores teams
• Many same shortcomings as Waterfall
• Teams looking for an established practice
TDSPCombines CRISP-DM and Scrum practices• Comprehensive open-source documentation • Includes Agile concepts
• Strong team focus
• Teams looking to “modernize” CRISP-DM
DominoCombines CRISP-DM and Agile practices• Visual roadmap with clear flow and decision points
• Includes practical tips
• More of a concept as opposed to a fully vetted approach• Teams looking to “modernize” CRISP-DM
OthersLesser-known life cycles• Each includes a novel viewpoint• Not well-known or vetted• Good “food for thought”

Agile Coordination Frameworks

Agility has taken over the software engineering world. Yet, it gets a mixed review for data science.

However, Agile and data science should go hand-in-hand. Don’t focus too much on the specific approach, but rather start with the fundamental principles your team aspires. From there, build a framework on top of it that defines how you can sustain team collaboration while also being flexible enough to shift the project’s focus.

Here are three agile frameworks that you can consider. Kanban is borrowed from manufacturing. Scrum from software. And Data Driven Scrum was designed specifically data science.

ApproachDescriptionStrengthsChallengesBest For…
KanbanVisualize flow. Minimize work-in-progress.• Simple
• Combines well with other frameworks
• Maximizes throughput
• Minimizes waste
• Least definitive
• Lots of ambiguity
• Starting with a solid core set of principles and building a framework on top of it
ScrumWell-known Agile approach focused on fixed-length iterations• Quick, incremental value focus
• Well-defined feedback loop
• Strong team focus
• Time-boxing can be restrictive
• Often poorly implemented
• Management might get in the way
• Agile teams who need discipline provided by fixed time cycles
• Radical innovation cultures
Data Driven ScrumAgile framework specifically designed for data science teams• Most of same benefits of Scrum and Kanban
•Caters to experimentation
• Relaxes Scrum pain points
• Not as vetted as Scrum
• Adds challenges of managing concurrent iterations
• Teams with strong experimental culture
• Data science teams that struggled with Scrum

Hybrid Approaches

The reality is that you can mix and match various approaches to design a comprehensive methodology that best suits your team, projects, and organizational needs.

This guide highlights two such hybrid approaches — each serving different use cases.

ApproachDescriptionStrengthsChallengesBest For…
Waterfall-AgileAttempts to combine best of Agile and waterfall• Allows for some flexibility while catering to broader constraints• “Best of both worlds” can water down the advantages from either• Highly-regulated projects that require rigid administrative processes
Research & DevelopmentCombines open-research phases followed by structured development• Gives flexibility for open-ended research
• Adds structure when needed to coordinate deliverables
• Difficult to monitor
• Can suffer from ad hoc chaos
• High trust and discipline required
• Mature teams who don’t need heavy oversight
• Research-focused teams needing freedom

Training and Certification

Team Lead Courses

We cover data science life cycles and agile coordination frameworks in-depth as part of the Data Science Team Lead courses.

By understanding and applying these concepts, you’ll be able to improve your team processes and deliver data science outcomes.

Advance your career by earning the Team Lead certification.