Similarities of Data Science and Software Engineering Projects
In many ways, data science looks like software engineering.
Both require significant coding to address an underlying business problem or opportunity, which typically requires frequent stakeholder interaction. Furthermore, when a production data science model is required, just as for traditional software systems, there is a requirement to include the appropriate documentation, automated testing, and production support for the data pipelines.
Given this similarity, it is no surprise that many think that the strengths and weaknesses of the different software development methodologies are also applicable for data science projects. Indeed many senior project managers assume that software development teams and data science teams can be managed with the same process that the software development team has been successfully using.
The Challenges of Managing Data Science Projects like Software Development Projects
However, data science has some key differences, as compared to software development. For example, data science focuses on exploration and discovery (such as “finding insight in the data”), while software engineering typically focuses on implementing a solution that addresses a specific requirement. While both can have ambiguous requirements, data science requirements tend to be more ambiguous due to the exploratory nature of the task, while software project requirements are often ambiguous due to unclear customer needs.
Other differences include the need to manage client expectations (since, for example, there might be no insight generated during an analysis), the need to handle the fact that the most useful insight might not yet have been identified within the dataset, the challenge in defining a specific schedule (e.g. due to the typically open-ended nature of the questions that need to be addressed), the challenge in validating the results within a data science analysis and finally, the challenge to identify additional data that might (or might not) prove useful, which can often only be known after the data has been acquired and analyzed.
These differences often cause challenges, that you’ll likely encounter if you try to manage a data science project as a software engineering project. Consequently, applying a project management methodology designed for software engineering onto data science will likely fall short of expectations. Below we elaborate on some of the key differences between software engineering and data science.
Generally, you know upfront whether a software engineering project is executable. Yet, “in data science, if a customer has a wish, even an experienced data scientist may not know whether it’s possible” (Godsey, 2017). Rather, the data science journey uses research, proof of concepts, and trial-and-error to effectively map the problem to the solution space; and in some projects, the question of solve-ability may not be answered until late project phases. A project management environment that does not understand and respect this ambiguity might try to force data scientists into prematurely making commitments that they are unable to live up to.
When is a data science project complete? Because a model is rarely 100% accurate and changes over time, it is never truly complete. Most data science outcomes leave significant room for improvement, but a perfectionist mindset may never lead to project closure and tie up resources that are better spent elsewhere – in fact, it is often not clear when the data science team has reached their most effective model. As such, a common challenge in a data science project is knowing when the team has delivered a “good enough” solution, and that further analysis is not worth the effort. Software teams counter such issues by agreeing to a definitive list of requirements that must be checked-off at the end of the project or each iteration. Yet, this is not feasible in many data science projects — If a team struggles to identify whether a problem is solve-able, how can it agree to metric-related acceptance criteria?
Much of the data science process involves tasks like data cleaning, exploratory data analysis, and experimentation that have unknown scope and complexity. Consequently, data scientists struggle to estimate turn-around time on a lot of deliverables. Russell Jurney in Agile Data Science 2.0 argues that “Even relatively new fields such as software engineering, where estimates are often off by 100% or more are more certain than the scientific process”. Likewise, Tyler Foxworthy, CEO and Chief Scientist at Vertex Intelligence, explains that “You can’t put a time box on [data science work] because you can’t schedule insight” (Foxworthy, 2017). Consequently, shoehorning exploratory-related tasks into strict time-boxed methodologies popular in software engineering might short-circuit the creative analytical process and frustrate data scientists. This likewise leads to inaccurate project schedules.
Knowing if your Solution is working
Functionality tends to be binary in software. As explained by Mark Clerkin, Data Scientist in Residence at High Alpha, “In software, I can tell if it’s functioning. It either works or it doesn’t.” But in data science, “How do you know you have the right answer?”(Clerkin, 2017). The answer might not be obvious or even be knowable but likely may involve statistical significance tests, A/B testing, and domain expertise that are more common in research and development than software engineering.
Tracking Project Progress
Traditional software project management measures progress by comparing completed to planned work, and agile teams measure progress with each incremental product delivery. But neither concept correlates well with data science. The plan of nearly every data science project will continually change through the project lifecycle which undermines the value of variance analysis. Measuring incremental product delivery makes more sense in some data science situations; however, true progress toward data science solutions often comes without product delivery. Rather, it might come through activities like reading journal articles, experimenting with a new library package, or even delivering a failed model that becomes the “genesis of the next breakthrough” (Domino Data Lab, 2018).
Planning for Post-Deployment
While software tends to behave in predictable manners (nobody needs to “retrain” software), data science models “change as the world changes around them” (Domino Data Lab, 2018). A team might be ill-equipped to manage unexpected model performance if they do not appropriately plan for this during the project phases.
We’ve explored the high failure rate for data science projects and the shortcomings of prevailing attempts to manage data science projects through ad hoc or software engineering project management practices. Yet, our outlook remains positive as organizations are beginning to respect data science project management as a distinct subject that is critical for the maturation of data science field.