During the past month, we conducted at poll to see what project management framework teams used to help execute their data science projects.
Based on our survey of 109 respondents, CRISP-DM was the most commonly used data science process framework (it was used by about half the respondents). This was followed by Scrum, Kanban and “my own/my organizations”. The results of our survey are shown in the chart below.
A quick review of the different alternatives
- CRISP-DM: The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model with six phases (business understanding, data understanding, data preparation, modeling, evaluation and deployment), that naturally describes the data science life cycle.
- Scrum: Scrum is most commonly used framework for software development projects and the de facto agile project management framework. It divides a project into a series of mini-projects, each of a consistent and fixed-length, called a sprint. Scrum also defines meetings and roles to help guide a team in executing a project.
- Kanban: Kanban’s two key Principles are to (1) visualize the flow and (2) minimize work-in-progress. In short, by limiting tasks that are being completed simultaneously, Kanban enables agility.
- My Own: There is a custom framework, either defined by the team lead, or more broadly across the organization, which is not based on one of the standard methodologies.
- Semma: Developed by SAS, Semma defines 5 phases of a project (Sample, Explore, Modify, Model, and Assess). Although designed to help guide users through tools in SAS Enterprise Miner for data mining problems, SEMMA is often considered to be a general data mining methodology.
- TDSP (Team Data Science Process): Launched by Microsoft in 2016, TDSP defines 5 stages of the data science life cycle (Business understanding, Data Acquisition & Understanding, Modeling, Deployment, and Customer Acceptance)
- KDD: Knowledge Discovery in Database (KDD)is the general process of discovering knowledge in data through data mining, or the extraction of patterns and information from large datasets using machine learning, statistics, and database systems.
- Other: This was often a response such as a combination of Scrum and Kanban
- None: The data science team does not use a well-defined framework
Comparing to previous surveys
The last time a survey was conducted to understand which methodology teams used for data science projects was in 2014 by KDNuggets. KDNuggets also did polls in 1010, 2007 and 2002. In comparing the results of previous surveys to our most recent survey, one of the most interesting findings from our survey is that the percentage of people using CRISP-DM has not significantly changed during the past 20 years (see the full results in the chart below).
Note that our survey choices were similar to the options offered in the previous KDD polls, except we added two additional frameworks (Scrum and Kanban) and the removal of three options (Other, not domain-specific; KDD Process, My organization’s)
Note on poll respondents
People coming to our site (www.datascience-pm.com) are likely much more focused on methodologies to execute data science projects. With this in mind, this poll is likely not a representative sample, in terms of what percentage of data science teams use a methodology (i.e., people who come to our site are probably more interested in data science process as compared to other data science teams).
A more accurate overall industry rate was likely captured by a Corinium Execute survey (in 2020), where they reported that only 48% of data science organizations have established standardized processes.
Exploring Google Keywords
To validate our results, we entered keywords for data science frameworks (e.g. “crisp dm”) into Google’s Keyword Planner tool which also suggested related keywords (e.g. “crisp dm data science” or “crispdm”) and provided their average monthly search volumes. We then added these values for each general term.
In addition, clearly irrelevant searches like “tdsp electrical charges” and “semma botha aagatha” were then removed. We also focused on just the USA as the search intent behind these terms becomes even more ambiguous internationally (e.g. “semma” is a hospital in the Dominican Republic).
Given that a searcher’s intent behind “crisp dm” will almost certainly be for data science/mining while the intent of a search for “tdsp” and “semma” could be for an electric bill or for the popular Bollywood movie, CRISP-DM probably has an even more commanding lead against the other search terms than what is shown above. below
CRISP-DM comes out as the most popular
Google search terms is consistent with our poll, CRISP-DM is the most popular framework, with Scrum and Kanban being two other popular frameworks.
Note that CRISP-DM is more focused on the data science workflow (the steps to execute a data science project), where Scrum and Kanban are more focused on how a team collaborates.
Hence, using CRISP-DM and Scrum (or CRISP-DM and Kanban) are possible and indeed, often occur.
Implications for teams
Does this mean that every team should start (or continue) to use CRISP-DM for their data science projects? Not necessarily – our work with data science teams suggests that:
- CRISP-DM does not fully address some of the most important team execution challenges (e.g., when to loop back, how to prioritize efforts)
- Most teams do not fully follow CRISP-DM
- Teams are not aware of alternatives to CRISP-DM
- Many teams that have tried to use Scrum have struggled to use it effectively