In the previous part of this four-part post series, we looked at traditional ways to manage a data science project. To re-cap:
- Ad hoc approaches should only be considered for smaller, one-off projects.
- Waterfall approaches might be partially appropriate for specific use cases but generally should be avoided.
- CRISP-DM fits the data science project life cycle but lacks key considerations for a full-fledged project management approach.
Today we explore managing data science projects using two agile approaches — Scrum and Kanban.
“Agile is a method designed to help you solve problems for which you can’t anticipate the process you go through. It will be iterative – you may hit roadblocks you may need to steer around.”
-John Akred, (former) CTO at Silicon Valley Data Science
Frustrated with the prevailing rigid, excessively planned, and overly documented project methodologies, thought leaders — largely from the software industry — started defining alternative adaptive approaches to manage non-linear and non-deterministic products and projects. The core tenets of agile are inscribed in the Agile Manifesto (agilemanifesto.org):
Manifesto for Agile Software Development
We are uncovering better ways of developing
Software by doing it and helping others do it.
Through this work we have come to value:
Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan
That is, while there is value in the items on the right, we value the items on the left more
Contrary to a common misconception, agile is not just Scrum or any set methodology but rather is a philosophy that focuses on concepts such as rapid iteration, short feedback loops, self-organizing teams, and adapting to changing needs. Approaches that generally follow the Manifesto and its twelve principles are agile.
Scrum (n): A framework within which people can address complex adaptive problems, while productively and creatively delivering products of the highest possible value
–The Scrum Guide (2017)
“A lot of parts of agile (scrum) are great for
data science but the part that concerns me is assigning time chunks or points
to projects and then them not getting completed and you get buried in the
– Mark Clerkin, (former) VP of Data Science at High Alpha
What is it?
Scrum is a highly collaborative agile framework that promotes transparency, inspection, and adaption to manage complex work.
Scrum divides a larger body of work into a series of fixed-length sprints. These mini-project cycles kick off with a sprint planning meeting whereby the product owner (the person who defines what needs to built) explains the top priorities from the product backlog (the product wishlist). The development team (the people who deliver work increments) forecasts what it can deliver by the end of the sprint and plan how it will accomplish the sprint goal.
During a sprint, the team sets daily plans at daily standups. At the sprint’s end, they demonstrate potentially shippable product increments at a sprint review and make plans for team and process improvement at a sprint retrospective. Throughout the whole process, a scrum master acts as a servant leader to support the product owner, development team, and broader organization.
How common is Scrum for data science?
Scrum is the most common agile framework per the 14th Annual State of Agile Report (2020) and is implemented in a wide variety of industries, especially software. I’ve met several data science teams implementing Scrum, usually in a modified form. Google “data science scrum”, and you’ll also find a lot of data scientists use it — often unhappily.
- Customer Focused: Focuses on delivering value to the end-users which unfortunately gets side-tracked in a lot of data science projects
- Adaptive and iterative: Does not assume that the solution is known upfront but allows the data scientists and product owner to discover the solution over the course of several sprints
- Empiricism: Emphasizes the role of proven evidence to improve processes and product delivery — something data scientists naturally respect
- Often poorly implemented: Many data scientists are forced to use Scrum in ways that are counter to its underlying principles (e.g. not having a scrum master or having a functional manager who calls all the shots).
- Time-boxing: Many steps in the data science workflow require an unknown amount of time and do not line up into pre-defined time periods. Trying to force a time-box might short-circuit necessary experimentation.
- Potentially releasable increments: Data scientists might be making meaningful progress without being able to deliver potentially releasable work at the end of a sprint.
The bottom line
Proceed with caution. Some elements of Scrum do not fit nicely with the open-ended nature of data science work. If implemented, do it right and hire a dedicated scrum master, allow the team to self-organize, and provide training.
What is it?
Originating as a Japanese inventory management system, Kanban has been heavily adopted by software teams as a continuous workflow management system. Emphasizing the visual status of work, Kanban’s central artifact is a status board that shows where each work item (Kanban card) is in its life cycle to completion. Work-in-progress (WIP) limits help identify bottlenecks and reduce work in progress. Cycle time (how quickly a work item is completed) is a key metric Kanban teams try to minimize.
Relative to Scrum, Kanban is less prescriptive in that it does not define set sprint cadences, team roles, or meetings. Note that Kanban and Scrum are not mutually exclusive. In fact, Scrum development teams often manage their sprint backlog with a Kanban board.
How common is Kanban for data science?
- Flexible: Perhaps the most light-weight process (other than ad hoc)
- Improves workflow: The highly visual nature and WIP limits help identify bottlenecks and allow for teams to improve cycle times.
- Easy to implement: No extensive training nor team role re-organization required
- Easier question: Data scientists are more receptive to answer: “What will you do next?” as opposed to a question like “What will you get done by the end of the sprint?”
- Not full-fledged: While the lack of defined customer touch points, team roles, and meetings provides flexibility, it also requires teams to find additional project approaches.
- Not deadline focused: Without the motivation to hit constantly approaching deadlines, work might drag on longer than needed.
- Questionable value add: One senior data scientist I interviewed complained that Kanban just didn’t improve outcomes — understandable, given the first two points.
- Unclear columns: Software teams often use status columns such as “to do”, “in progress”, “in test”, “complete”. But it’s unclear what columns make most sense for data science.
The bottom line
Kanban is a great project management approach for both mature data science teams that don’t need an extensive approach and for teams without structure who want to “upgrade” from just being ad hoc. However, most teams will need additional practices to support the complexity of data science projects.