Agile Data Science

There are three key concepts that should be followed within an agile data science effort – use iterations, keep the iteration as small as possible and get feedback on each iteration.

In other words, while there are several alternative data science workflow frameworks (sometimes known as data science life cycle frameworks), to achieve agility, agile teams should execute an agile data science project by:

  1. Using iterations: The concept of an iteration is commonly understood as a foundational element within many agile frameworks, and that an iterative approach helps to achieve agility. In data science, an iteration could be thought of as an experiment.
  2. Keeping the iteration as small as possible: Each iteration should yield insight, even if this insight is that a certain variable is not helpful in trying to generate actionable insight. This insight should be used to help define and prioritize future tasks.
  3. Getting feedback on each iteration: One of the keys to prioritizing future work is to discuss the results of that work with the user/client. In this way, discussions on what might be useful actionable insight can help drive the definition and prioritization of future experiments.

Furthermore, many do believe that teams should use an agile approach for data science. For example, Gartner insists that “work that is more exploratory, is less known and demands quick results […] demands agile”. By its nature, data science is an ambiguous, non-linear process that tends to lack clear up-front understanding and requirements. In a discussion on the data science process, John Akred explains that data science needs “management techniques that accommodate and foster and help the non-linear processes succeed instead of attempt to force them into linearity”.

Having a slightly different perspective, Management Consultant Daniel Mezick suggests that a better question than asking whether agile works for data science, is to ask: “Are you trying to deliver continuously or very frequently?” If yes, then agile makes sense for your project. If not, then you could still benefit from certain aspects of agile but not necessarily from the entirety of an agile framework. Pressman & Maxim pose the question about agility for data science slightly differently. They posit “No one is against agility. The real question is: What is the best way to achieve it?”

So, while one recent study  found that somewhere between 25% to 50% of data science teams currently use an Agile approach, this percentage will likely increase in the future, as many have noted the importance of agility when doing a data science project (e.g., Vamsi NellutlaDarío Martínez and Victor Borda).

Agile Manifesto

At a foundational level, the 2001 Agile Manifesto, shown below, provides the foundation for agile practitioners. In short, using an agile data science approach does not define what should be done, just how it should be done. This is in contrast to an approach such as CRISP-DM, which focuses on the steps / phases  of a data science project (but not on how the team should iterate and get feedback during the project). Furthermore, as discussed below, there are many ways to achieve an agile data science process.

“We are uncovering better ways to developing
Software by doing it and helping others do it.
Through this work we have come to value:

Individuals and interactions over processes and tools
Working software
over comprehensive documentation

Customer collaboration
over contract negotiation
Responding to change over following a plan

That is, while there is value in the items on the right, we value the items on the left more”

Approaches to Achieve Agile Data Science

Unfortunately, it is often not clear how a data science team should “do” agile. In other words, what process framework should a team use to achieve “agile data science”? One approach that has been suggested  is to use Scrum. In fact, Silicon Valley Data Science, which was data science consultancy (all their key data scientists were hired by Apple), noted the use of a scrum-like  process. In addition, Microsoft’s team data science process (TDSP) incorporates aspects of Scrum (e.g., sprints, product backlog) in a complete framework (e.g., hardware / software stack for doing data science).
 
While these discussions did not mention challenges that might arise when using Scrum in a data science context, many others have found that Scrum, within a data science context, is problematic. For example, as noted by Eugene Yan, it is hard to estimate task duration (which makes doing a sprint difficult), and tasks or priority of a task might change as insight is starting to be generated (which might mean part of a sprint should be canceled).  Jeffrey Humpherys specifically described this situation, noting that sprint deadlines make it difficult to explore the data and investigate potential new ideas.
 
Others, such as Chang Lee and Edwin Thoen have observed these challenges with Scrum and suggested that Kanban (which is tasked focused but does not have time-boxed sprints) might be more appropriate. This point of view was further supported by an experiment, where some data science teams used Scrum and others used Kanban.  The results of the experiment and the observations of many practicing data scientists suggest that time-boxed sprints are a key challenge when using Scrum. This was recently summed up by Yonatan Hadar, who stated that “data science research is hard, how can you give a time estimation when you are not sure that your problem is solvable?”
 
Hence, others (e.g., Ori Cohen and Shay Palachy) have noted the need for agility within a data science project, but also the need to work through the phases of a project (like the phases described by CRISP-DM). Similarly, elder research describes their agile data science framework as leveraging the CRISP-DM approach in an iterative fashion.
 
So, while there is no commonly agree to set of phases or steps, there is agreement with the need for agility as well as thinking through the steps required within a project. It is also now possible to easily take data science process training certified or even a certified data science team lead.
 

Agile Data Science Benefits

Agile approaches can provide numerous benefits for data science teams (Pressman & Maxim, 2015) (Sutherland & Schwaber, 2017; Kent, et al., 2001):

  • More Relevant Features: By defining requirements just before development, the features are more likely to meet the most current needs.
  • Quicker Delivery of Customer Value: By delivering incremental product features, users gain value before the project’s completion.
  • More Realistic Feedback: By soliciting feedback on the functional product, the agile team can more accurately assess whether their deliverables are of value and to adjust future deliverables according to feedback.
  • Cut Losses from Building Wrong Features: If stakeholders provide feedback that a product feature is no longer useful, agile teams can learn this sooner, cut their losses, and divert efforts elsewhere.
  • Cut Losses from Infeasible Features: Likewise, if data scientists are tasked with upfront discovery and analysis instead of developing an entire model, they are more likely to realize if they are working toward a dead-end that is not technically feasible.
  • Improved Communication: Most agile approaches promote close coordination and communication within team members and with stakeholders.
 

Agile Data Science Weaknesses

  • Less straightforward than waterfall which can lead to process confusion and poor implementation. Without defined cost estimates and timelines, teams might struggle to justify agile projects to executive sponsors who want to know how much they need to invest and when the product will be delivered.
  • Organizational communication: Communication can be challenging with the broader organization who may not be used to, accepting of, or confused by agile. Not surprisingly, 70% of agile practitioners in State of Scrum Report reported tension between their teams and the rest of the organization.
  • “Made for software”: A common general complaint from interviewees is that most existing agile approaches were designed by the software industry for software projects. While considered to be similar to software engineering, data science has its own unique challenges that existing agile approaches may not address. Applications of existing approaches might inhibit the exploratory nature of data science, lack the rigor to deal with the messiness of big data, use software engineering testing techniques that are not suitable for data science, and may not help evaluate whether the results are “good enough” to make a difference.
  • Perceived planning issues: Poor project planning was another common complaint from interviews we conducted as several practitioners believe that agile skips over planning and that a more rigid process for requirements gathering and definition is needed.
  • Not regulatory friendly: A specific complaint from a project manager of data-intensive projects at Eli Lilly, a large pharmaceutical company, is that agile testing practices are not practical for FDA regulatory compliance.

A Brief History of Agile

During the 1980s, in response to the shortcomings of Waterfall, some companies began to embrace speed, flexibility, and overlapping processes over rigid, linear, and distinct project phases (Takeuchi & Nonaka, 1986).

Since the 1990s, organizations, largely from the software industry, have formalized such approaches that focus on rapid incremental delivery and continuous customer feedback rather than linear plans that feature extensive documentation and detailed upfront planning. Not every agile approach follows all the principles equally but approaches they generally follow the manifesto and principles are considered to be agile (Mezick, 2017).

Agile is widely adopted at technology companies and has become so important that the Project Management Institute, which focuses primarily on traditional project management, now also offers an Agile Certified Practitioner certification and added significant agile portions to the Project Management Book of Knowledge in its September 2017 release.

< Previous: CRISP-DM | Next: Scrum >

References