What do flight attendants, surgical teams, and successful data science project managers have in common?
They all use a checklist.
Why? Although every surgery, project, and flight is different, each has a repeatable pattern. And checklists can help remind us of each step and major consideration.
Don’t reinvent the wheel. Start with a base data science project checklist or charter template and customize from there.
Data Science Project Checklist Overview
Review the Checklist Structure
This data science project checklist is organized in three sections:
- Define a clear problem/opportunity statement
- Define a solution
- Define the approach
First, starting with a clear Why defines the core objective. The solution comes secondary as the set of deliverables that map to the market or stakeholder needs. Finally, the implementation plan sets up the resources, roadmap, and various considerations that will help you accomplish the project goal.
Customize It
Every organization, team, and project is different and therefore has a separate set of planning considerations. Thus, don’t use this checklist as-is. Rather, this is a template with some items that should always be included (like “what are trying to solve”) and others that are circumstantial (like SLAs). Meanwhile, ask yourself “what else should be included” beyond this base checklist.
Be Flexible
You could define everything up-front or just wing it. But really, you should land somewhere in between these two extremes. Fully defining every detail in a checklist before working on the solution means you’ll never start. Yet, the other extreme doesn’t scale, and you’ll likely hit avoidable roadblocks or wind up delivering something that doesn’t actually solve anything.
Thus, develop a checklist that is somewhat standardized but is still flexible. The goal isn’t to fill out the checklist itself but rather to build a sustainable process.
Created with the Imgflip Meme Generator
1. Define a Clear Problem or Opportunity
Define the Problem
How often do we start a project, just to realize that we are fundamentally trying to solve the wrong thing? Resist the urge to simply start modeling. Rather, take advice from mathematician John Tukey…
Image Source: AZ Quotes
In other words, define the right question first and get to an appropriate answer. You can always iterate on the approximate answer with the right goal over time. But it’s much more difficult to recover a project if it is locked into the wrong target. This leads to the first checklist items
- What is the underlying problem we are trying to solve? / What is the underlying opportunity that we are trying to develop?
- This is usually different from the presenting problem. True needs usually differ from the presenting asks.
- List out the high-level questions that the team will investigate during the project.
Set the Goals
How do you know whether your intended product delivers value? Consider…
- How will success be measured? This metrics post details specific suggestions but broadly there are business, model, and system metrics.
- Business impact metrics – Typically these are already existing metrics like financial metrics, subscriber churn, or mean-time-to-failure. Of all the metrics, these will be most important as they are the closest measure of true value.
- Data science metrics – Might not need to be defined upfront but identify some possible model performance metrics such as specificity, sensitivity, or lift goals.
- System metrics – Might be covered in the SLA section below.
- Are there specific goals?
- Be specific if you can. Example: “Decrease Gold Star voluntary 3-month unsubscription rate from 25% to under 20% by the end of the year”
- However, this can be tricky as you don’t want to pigeonhole yourself into specific goals that might not be achievable. As an alternative, you could look into something like the break-even point (e.g. the project will pay for itself if it decreases Gold Start voluntary unsubscription rate by 5-percentage-points).
- What should the team not do?
-
- What are the anti-goals? (i.e. things you want to specifically avoid doing)
- What items are beyond scope? (often this list is a good starting point for follow-up projects)
List the Assumptions and Hypothesizes
Assumptions are what you take as given and generally beyond your control (e.g. We assume that the business will maintain the Platinum-Gold-Silver status structure through at least 2026). Hypotheses are the specific assumptions that you will test as part of the project (e.g. Customers who sign up for the Platinum 2-year subscription are less likely to churn).
- What do you assume to be true (and beyond the control of the project)?
- What do you hypothesize to be true (and will test as part of the project)?
Identify the Stakeholders
A stakeholder is anyone who is impacted by the project. Think broad. Broader societal stakeholders might not immediately be obvious. And look inward – Some projects might stop dead if you don’t get buy-in from legal, networking, or security shared service teams.
So, identify the stakeholders and include the key ones in the relevant problem definition and planning phases. Some stakeholders, like the project champion, should be involved in much of the project planning. Meanwhile, others like security might not care about the project goals but require that you abide by their security protocols.
- Who are the stakeholders?
- What are their needs?
- What is their authority over the project?
- How receptive are they to the project?
- Should you train them on the data science life cycle or agile practices?
2. Define the Solution
Stay focused on the core needs (defined above) and map possible solution deliverables that fill these needs.
Setup the Backlog
Start by defining small, incremental deliverables that help 1) validate whether the intended solution will solve the underlying business needs and 2) uncover whether/how the problem can be solved.
To serve both the market validation and technical validation, an effective product manager will define a product backlog which is a fungible “wish list” of deliverables. Avoid the horizontal waterfall-style CRISP-DM phases and instead focus more on slim vertical agile deliverables (Related: Vertical v Horizontal Slicing).
Note that the end project goals should not shift often but the path to get there will. Thus, among all the items in this broader checklist, the product backlog typically requires the most ongoing refinement. It’s not static but rather should change with each new lesson learned. A good practice for the checklist is to be general and cross-reference a detailed backlog view such as one maintained in Jira, Rally, or Trello.
- What is the core deliverable?
- This is typically the output of a machine learning model
- How does this deliverable solve the underlying problem?
- How can you slice the deliverable into small units of value?
- Often these flow through a life cycle such as:
- Data reporting/business intelligence dashboard
- Proof of concept model
- A / B tests
- Beta deployment
- Production deployment
- Enhancements
- How does each deliverable help guide the product definition?
- Stack rank the most promising deliverables. Often focus on the most valuable deliverables relative to their effort.
- Combine a logical set of the most promising deliverables to define the Minimal Viable Product (i.e. the first major release).
- Avoid dependencies among deliverables. But if unavoidable, note these dependencies.
- Often these flow through a life cycle such as:
Define Consumption and Deployment
Typically, this work falls into later project phases. However, understanding these early in the project helps the team understand any necessary constraints for the model and its output.
Note the nuance between deployment and consumption. A basic rule of communication is that just because something was communicated or deployed, does not mean that it was received or understood.
An overall business or systems process flow can help guide a deeper understanding of the model output consumption and overall value stream.
- How will the output be consumed?
- Human consumption vs into another system
- If for human consumption, how will the user interact with the model output?
- How will the deliverable be deployed?
- Real-time vs batch vs ad hoc?
- APIs needed?
- Available in the cloud vs on-prem?
Define Operations
A stand-alone one-off model does not scale. As such, consider all the machine learning operational systems needed to deliver sustainable value over time. For a deeper dive, look into the ML Operations post or the Operations section of the Life Cycle post.
- What are some supporting deliverables?
- Data monitoring system
- Model monitoring system
- IT monitoring systems
- What are the service level agreements?
- Typically detailed for contractual or mass-market products. If a non-critical, internal system deliverable, SLAs might not be required.
3.Define the Approach
Now we dive into what most people think of in a project plan checklist. The biggest warning is not to get pulled into the trap of trying to define too much upfront. Rather, plan what is needed upfront and adjust based on the project progression. The level of upfront planning is circumstantial based on organizational culture, project size and complexity, and project risks.
Define Resource Plan
This section generally applies to most types of projects. Yet, the most unique aspect of data science projects is the emphasis on data planning. Indeed, a data science project manager needs to also manage the collection, privacy, and accuracy of the data. Without this focus, the project could fall into garbage-in-garbage-out or into sticky legal or ethical issues for data misuse.
Resource needs
- Staff
- Internal – individuals or more general skillsets
- Contractors or partner staff needed
- Hiring needs
- Anticipated hours or length of effort needed
- Broader team members such as Security or Networking or Legal
- Communication – points of contact, setting up shared drives, chat groups, meetings cadences, etc
- Training
- For the project team
- For the stakeholders
- Technology
- Licenses vs open-source
- Cloud vs on-prem services needed
- Requests needed such as new accounts or environments
- Data
- What data do you need?
- Do you already have the data?
- Can you start collecting the data?
- Can you purchase the data?
- What metadata do you need (e.g. does the data need to be labeled?)
- Data privacy and environment
- Classification: PII, CPNI, HIPAA, etc
- Legal/ethical use: Can you legally or ethically use the intended data? What data can you not use
- How clean is this data?
- Unless your team has already used the intended data source, exploratory data analysis is needed before understanding the data cleanliness.
- Data engineering
- What is data Volume (size)?
- What is the data Variety (formats)?
- What is the data Velocity (rate of influx)?
- How can we process and store this data?
- What data do you need?
Investigate What’s been done
Don’t start from scratch unless you need to. Look at what has been done in the past that you can leverage to jumpstart your project. Even past failed projects might have created some useful artifacts or lessons learned.
- Prior artifacts
- What is the organization currently doing (if anything) to solve the underlying problem?
- Has my organization attempted something similar in the past?
- Have other organizations or researchers tried this?
- What were the results?
- What can we learn from these prior projects?
- What can we use from these prior projects?
Set up Project Roadmap
Timelines are tricky for any project – especially for research endeavors such as data science projects. Why? Because you cannot schedule insights. You can’t ask a data scientist to ensure the model converges in the third sprint.
There are two different extremes for handling time uncertainty. One is to plan all out of the timelines, often with extensive documentation and Gantt charts. The other is to just accept that timelines are not realistic and to ignore this topic altogether.
The first is dangerous as it often leads to a false communicated sense of certainty and short-circuits the data scientists’ need for research and experimentation. Yet, the latter might not be accepted by stakeholders.
How can you balance this? Communicate what is known and be honest about the rest.
For example, you might not know the date of a deliverable, but maybe you know that data collection will take four weeks, the data analysis will take a week, and the modeling phase typically takes anywhere from one to three months.
Or, even if I don’t know the timeline, you can probably communicate some of the logical next steps. Such as, we’ll first explore existing data sets. Based on this assessment, we might need to start capturing new data or purchase new data. Eventually from these, we’ll engineer a new data set and proceed to modeling, etc.
A project roadmap is a great way to communicate the logical project flow
- Roadmap and Timings
- What is the sequence of possible deliverables?
- What are the go/no-go decisions to continue the project?
- What are the external influencers on the timelines? (e.g. an app launch date or a marketing campaign)
Assess Risks
In addition to the “standard” set of risks that a software project manager might face, a data science project manager has to manage some unique risks – we’ll discuss just the data, modeling, and ethics risks but the checklist has additional considerations.
Data: Data scientists are frequently thrown a data set that is “known to be clean”, only to uncover fundamental data flaws that compromise modeling efforts.
Modeling: Generally a good software engineer can tell you before project execution whether or not the request is feasible (“Yes I can build this widget” or “No, that’s not possible”). However, even a tenured data scientist might not know whether they can tease out enough signal from noise in the data.
Ethics: Software issues tend to be more binary. For example, you know whether to accept a customer’s email field input as valid or not by checking to see if they respond from the email auto-notification verification. Items like ethnicity or gender don’t even apply there. But how do you know whether that same customer’s request for a loan should be approved or not? Even if you don’t collect information regarding ethnicity or gender, discriminatory biases could make their way into the model.
- What are the risks? For each risk, qualify the likelihood and impact. Set mitigation plans for anything of mid/high risk. Some categories are:
- Security (e.g. how do I ensure the data or model are not compromised?)
- Legal (e.g. will data privacy law changes impact our ability to use the intended data?)
- Ethics (see the Data Science Ethics post)
- Business (e.g. market changes, stakeholder turn-over)
- Resources (e.g. team member turn-over, shifts in funding allocation)
- Technical (e.g. data, modeling, computing availability)
Learn More
Planning a data science project is a daunting task and there is no “magic bullet” guarantee for success. But this data science project checklist or the related data science project charter can help your odds.
To explore further project planning resources, check out:
- Microsoft’s Team Data Science Process (TDSP) Project Charter
- The various deliverables defined in the Business Understanding phase of CRISP-DM
- The Essential Machine Learning Project Checklist from Patrick Guzman focuses on machine learning project execution steps.