“Data science is not a bunch of technologies, but it is a process. It is a process that must be managed as a project. However, CEOs might think that technology is the solution.”
-Carol Choksy, Indiana University
Data science is emerging from the intersection of several up-and-coming technologies, but a critical issue that holds this nascent field back is not technical. Rather, like most endeavors, data science success is dependent on the effective execution of a project.
Yet, industry lacks an understanding of how to manage a data science project: Is it software? Is it research? Or maybe, simply magic?
This post series assesses 10 ways to manage across four posts:
ad hoc (adjective)
1a: concerned with a particular end or purpose
b: formed or used for specific or immediate problems or needs
2: fashioned from whatever is immediately available: IMPROVISED
What is it?
If you don’t have an established way to manage your project, you’re likely using an ad hoc approach. Project planning might be minimal or completely absent. Without concern for a broader project plan, decisions are made on the fly — often by the most senior data scientist.
How common is ad hoc for data science?
Teams do not explicitly aim to manage projects in an ad hoc manner. However, without established methodologies for managing data science projects, teams generally suffer from low level of project management maturity and often resort to ad hoc practices. This is particularly common for small and new teams.
- Flexible: Not bound to a rigid process
- Low Overhead: Administrative burdens could be minimal, especially to launch a project.
- Poorly planned: Often leads to re-work as key considerations are missed
- Poorly coordinated: Could lead to chaos and non-productive behaviors
- Not reproduce-able: Generally not documented
- Not scalable: Often just in the lead data scientist’s head
The bottom line
Avoid heavy reliance on ad hoc approaches except for possibly low-impact, low-risk, one-off projects coordinated by an individual or small team. The use cases that fit this are diminishing.
“The Gantt chart, because of its presentation of facts in their relation to time, is the most notable contribution to the art of management made in this generation.”
-Wallace Clark in The Gantt Chart (1922)
[…] Everything related to a project carefully laid out on those massive Gantt charts, every task measured out precisely in hours highlighted in pretty colors flowing down the page like a waterfall. Those charts were beautiful in their precision. They were also complete fabrications.”
-Jeff Sutherland in Doing Twice the Work in Half the Time (2014)
“We’ve all kind of abandoned Gantt charts…we sometimes fall back to some of that stuff in the absence of anything else, but I never see it go well.”
What is it?
Originating from manufacturing and construction, waterfall is a rigid, predictive project management approach that maps the project into distinct phases. The project work flows through these defined phases such as what is laid out below:
Regardless of the terminologies and phases used, waterfall projects start with an initial phase and then cascade sequentially in a forward linear pattern toward the final phase. Revisiting prior phases is counter to a true waterfall approach and, rightfully or not, might be viewed as a consequence of poor planning.
Waterfall has come under fire, especially from the software engineering community, for not being able to map plans to the realities of ever-changing business needs.
How common is waterfall for data science?
I’ve not met a data science team that explicitly states that they follow a waterfall process. However, data science teams often fall back into the pull of waterfall without realizing it. There are several teams that combine other approaches with waterfall which makes sense for specific use cases.
- Thorough: Comprehensive upfront planning can uncover important considerations that might otherwise remain hidden until late in the project.
- Easy to understand: A well laid-out plan with defined timelines might be easier for an executive to sign-off on compared to a flexible roadmap.
- Poor feedback loop: Stakeholders need something more tangible than a project plan to provide meaningful feedback.
- High overhead: Administratively cumbersome with excessive documentation
- Constrains experimentation: Model experimentation doesn’t fall neatly on a Gantt chart.
- Not realistic: Can you confidently say what task a data scientist will complete in Week 10?
The bottom line
Avoid when possible. Data science projects are inherently iterative and should not be shoe-horned into rigid, pre-defined approaches that heavily emphasize upfront planning and cumbersome change control processes. However, some elements of waterfall might be appropriate to fulfill specific use cases such as to meet regulatory needs.
“You may well observe that there is nothing special here and that’s largely true. From today’s data science perspective this seems like common sense. This is exactly the point. The common process is so logical that it has become embedded into all our education, training, and practice.”
One of CRISP-DM’s founders
What is it?
KDD: Knowledge Discovery in Database (KDD) is the general process of discovering knowledge in data through data mining, or the extraction of patterns and information from large datasets using machine learning, statistics, and database systems. KDD approaches tend to focus more on the steps to execute data mining as opposed to describing a comprehensive project management approach.
CRISP-DM: Formed in 1996 to standardize a data mining process across industries, CRoss-Industry Standard Process for Data Mining (CRISP-DM) is the most well-known KDD approach. CRISP-DM describes six iterative phases as shown below. Each phase has its own defined tasks and set of deliverables such as documentation and reports. Despite its popularity, CRISP-DM has not been revised since 2000.
How common is CRISP-DM for data science?
A 2014 poll suggests CRISP-DM is the most popular approach for managing data science projects. I’m not aware of more updated information but anecdotally, CRISP-DM is still the most recognized methodology for data science.
- Common sense: Data scientists naturally follow a CRISP-DM-like process.
- Cyclical: Supports the iterative nature of data science
- Strong start: Explicitly starts with business understanding – an often-overlooked step
- Documentation heavy: The full-fledged CRISP-DM approach requires a lot of time-consuming documentation (although most teams seem to skip much of it).
- Could lead to slow starts: The process matches closely with building a horizontally-layered approach which could delay business value delivery.
- Not really a project management approach: Does not address how to coordinate a project as a team
- Stale: CRISP-DM not been updated since its beginning and is criticized for not meeting the considerations of big data and cloud-based systems.
The bottom line
CRISP-DM is good for what it is: a natural description of workflow in data mining projects. I reference it frequently to my project stakeholders to explain the data science life cycle. However, like other KDD approaches, CRISP-DM provides a task-focused approach and fails to address team and communication issues. Thus, CRISP-DM should be combined with other approaches.