What is Waterfall?
Waterfall, also referred to as the classic life cycle or traditional project management, originated from manufacturing and construction and was applied to software engineering projects starting in the 1960s. A waterfall project flows through defined phases such as shown in the diagram to the right. Some waterfall models include variations of these phases that might include conception, initiation, communication, planning, analysis, construction, development, testing, or deployment.
Regardless of the terminologies and phases used, waterfall processes start with an initial phase and then cascade sequentially in a forward linear pattern toward the final phase. Revisiting prior phases is counter to a true waterfall approach and, rightfully or not, might be viewed as a consequence of poor planning. Work break down structures (WBS) are often mapped out in Gantt charts with each row representing a task and each column a time period (below).
Does Waterfall work for Data Science?
No is the short answer. Waterfall is most effective when the technology is well-understood, the product market is stable, and requirements are not likely to change much during the course of the project (Pressman & Maxim, 2015). As data science projects rarely meet any of these criteria, waterfall approach is rarely appropriate for data science.
However, some practitioners find that a structured waterfall-approach works well for certain phases of the data science project such as planning or validation (see Bimodal: Waterfall-Agile Mix).
- Waterfall provides a clearly-defined and easily-understood structure for a project.
- Gannt charts facilitate communication, give confidence to stakeholders, and highlights time slippages relative to plan (Clark, 1922).
- In theory, by planning and documenting extensively upfront, the timing and details of every step are well-known.
- Locking in requirements early prevents “scope creep” from overloading the project.
- During the implementation phase, typically the most extensive part of a project, the team focuses on well-defined responsibilities, project managers avoid conflicts with plenty of lead-time, and project sponsors objectively judge the project’s progress relative to the plan (Pressman & Maxim, 2015).
Weaknesses and Challenges
Waterfall has come under fire, especially from the software engineering community, for not being able to map plans to the realities of ever-changing business needs. “The trouble is, once that beautifully elegant plan meets reality, it falls apart” (Sutherland, 2014). This is especially true for projects like data science where waterfall may lead to:
- A False Sense of Clear Business Requirements: Rarely is the business able to articulate all its needs prior to the start of a project.
- A False Comfort of Understanding the Problem Space: Even if business requirements are clear, data scientists need to be able to explore their problem space before being able to determine if the problem is solvable and, if so, how they should execute their work.
- Delayed Value Delivery: The extensive up-front planning delays the start of the execution phase, and, the entire project is delivered at the end. Consequently, stakeholders are not able to gain value until project completion.
- Low-Quality Feedback: Without access to functioning pieces of the product, users are unable to provide effective feedback.
- Inaccurate Timelines: The time required to complete certain tasks in data science is unknown, rendering the project timeline inaccurate.
- Increased Cost of Errors: By delaying verification and validation after the development phase, poor feature design and bugs become more expensive and difficult to resolve.
- Higher Overhead: Extensive documentation is time-consuming and does not directly add value.
Avoid waterfall and be on the lookout for the ” ‘pull of waterfall’ where supposedly agile projects take on the characteristics of the waterfall” (Jurney, 2017). For example, the CRISP-DM methodology is often unknowingly implemented as a waterfall process that is focused on documentation and delivering large horizontal slices of work (e.g. thoroughly conduct data engineering first, comprehensive modeling second, in-depth analysis third) which defers business value as opposed to trying to line up thin vertical slices of value delivery (e.g. deliver a basic model as quickly as possible and refine it iteratively in future cycles).
If you are set to deliver through a phased approach, break the project into a series of smaller waterfall projects as per an incremental model. More effective feedback loops can be built into incremental models by “adding an explicit opportunity for feedback after each increment or iteration” as per an iterative model (Hoatle & Wilson, 2017). As explained in the following section, a series of several, short delivery iterations with feedback would be classified as agile project management, which better accommodates the uncertain nature of most data science projects.
While a pure waterfall approach should be avoided, elements of waterfall can be mixed with agile approaches to form a bimodal approach.
Traditional Approaches Similar to Waterfall: There are several data science methodologies that are inspired by waterfall and feature a step-by-step approach to data mining projects. These include:
- KDD and Data Mining: The general approach toward discovery knowledge in databases. These are represented with 5 to 7 sequential phases and tend to focus on the core technical processes.
- SEMMA: A specific KDD approach with 5 sequential phases developed by IBM.
- CRISP-DM: A 6-phased sequential approach that is the most comprehensive of the waterfall-inspired data science approaches. Still today, it is the most popular approach for data science projects.
But to cater to the ever-shifting realities of team-based data science projects, you should consider more modern approaches to managing your data science process.