There’s no single data science project documentation recipe. Rather, your documentation needs will vary by project, team, organization, and industry.
And it’s not just about producing data science model documentation. Instead, think broader and ask – What do I need to document and why?
Once you’ve thought this through and have goals in place, you can then set a repeatable plan for how to document a data science project.
Guiding Principles
Let’s start with three guiding principles.
1. Document with a purpose
Before you build out your documentation, ask:
- Who will consume this documentation?
- Why do they need this documentation?
- How would they like to consume documentation?
Think broadly and don’t take a “one-size-fits-all” approach. Rather, you should create various artifacts that best serve each set of stakeholder’s needs.
Stakeholder | Primary Interest | Common Artifacts |
---|---|---|
ML and data professionals | How the model works What data is used |
Code comments Model cards README files User stories |
Software engineers | How the system runs Service level agreements |
Code comments Runbooks README files User stories |
Business stakeholders Product owners |
Use cases Business impact / ROI |
Slide decks User stories Product roadmaps Cost benefit analyses |
End users | How to use the system | User guides |
Impacted individuals | Key decisions that impact me | Touchpoints such as emails or push notifications |
Regulators | Regulatory compliance Data privacy |
Compliance audits |
Project team | How can we efficiently deliver the project | Project plans User stories Design documents |
Security professionals | Data privacy System security |
System audits Data usage reports |
Quality assurance professionals | System reliability | Code comments Test use cases User stories |
2. Prioritize deliverables over documentation
At first glance, this might seem like an odd best practice for effective documentation.
But think about it. Your goal isn’t deliverable world-class documentation. Rather you aspire to deliver valuable insights and modeling systems that improve your internal stakeholders, end-users, and broader society.
As such, don’t let the documentation drag you down. Time constraints will occur. When they crop up, don’t compromise the quality of your models and systems. But it is generally acceptable to let some quality slip in your documentation (particularly if shortly thereafter you clean it up).
Prioritizing deliverables is a key tenet of the Agile Manifesto
3. Keep it simple … but not too simple
Related to the above point, if you document more than what’s needed, you’re taking time away from your model. Indeed, as you read through each of these best practices, skip items that don’t meet your specific needs.
To accomplish this:
- Cut out the fat and remove any outdated/un-needed documentation.
- Avoid redundancy across documents whenever possible.
- Don’t document more than you have to.
- Avoid extensive upfront requirements gathering.
- Be flexible and update the documentation as you go through the life cycle of the project.
Yet on the other hand, if you are too light on your documentation, you’ll accrue technical debt with systems that you don’t understand how to maintain. Inefficiencies may slow down your pace, regulations, and policies might be inadvertently forgotten, and chaos may ensue.
In short, find the right balance.
Data Science Project Documentation
Keeping these principles in mind, let’s move on to documenting your plan.
4. Start your project with a clear purpose
One of the most frustrating and wasteful endeavors is to develop something that no one really needs. And yet, we all fall victim to this sometimes (I know I have, at least).
To help mitigate this risk, start any project with a clear purpose. To accomplish this:
- Document the customer’s business objectives.
- Define how your data science project will meet their needs.
- Set a vision for your project or product so that you can steer the team in the right direction.
- Define clear evaluation metrics so that you can objectively determine whether the project was successful.
- Conduct a cost-benefit analysis can help determine project go/no-go and prioritization against other potential projects.
- Document what you are not looking to accomplish (beyond your project scope).
5. Develop a sufficient upfront project plan
The project plan encompasses many of the previously stated items such as the vision and purpose. It will also be more comprehensive, defining items such as…
- Resources – Staffing, computing requirements, cloud services, software, etc.
- Milestones – Projected deliverables inclusive of their general deliverable dates
- Budget – How much effort will the project require? / How much funding can the project team spend?
- Risks and contingencies – What could go wrong? How will you mitigate these potential risks?
Don’t go overboard but plan enough upfront so that you can execute your project more efficiently. Generally, you should start with a solid understanding of the initial project work and desired end goal. Define key dependencies and risks that fall in between. Most of the middle of your project plan can just be placeholders that you’ll update as you proceed.
Also, don’t fall in love with your plan. It will be wrong. Accept that and be ready to flex the plan throughout the project to meet the evolving realities you encounter.
The project plan should scale with the size and complexity of a project. Even for a small, simple project, jotting down some basic bullet considerations and a process list could help you conceptualize your approach. Meanwhile, larger endeavors should have more comprehensive plans.
A related key artifact is a project or product roadmap. This maps out how each of your intended deliverables will evolve into and fulfill your desired product vision. A good roadmap is lightweight and fits nicely on a slide or webpage.
A data science roadmap example that you can customize and re-use
An accompanying artifact to the product roadmap is the product backlog. This is a key artifact that Scrum and Data Driven Scrum teams use to keep track of deliverable ideas. A preferred format for each backlog item is the user story format.
6. Consider a Data Science Design Document
An alternative view from the project plan is a data science design document which Vincent likens to a lighthouse that guides you toward a specific destination. He outlines a data science design document with:
- Objectives
- The minimum viable product
- Research and explorations
- Milestones and results
We’ve covered most of the concepts already. But the new one here to elaborate on is the Minimum Viable Product (MVP) which is the next version of your data science product that allows you to learn the most about the problem space with the least level of effort. This could be, for example, a one-time offline model that predicts a subset of the overall problem space. From there, you can extend this model to a broader set of use cases, and transition it into a model that runs on an ongoing basis.
7. Write data science user stories
You should frequently generate ideas for delvierables and insights and deliver on those with the highest value relative to the expected effort. A great way to organize these ideas is in user stories which are short statements, often with some accompanying details such as acceptance criteria or links to more thorough requirements.
A typical user story format is stated from the lens of the stakeholder. It identifies who the stakeholder is, what they would like to receive, and why they would like deliverable.
Example Data Science User Story
As someone is denied a credit card
I would like to receive a timely email that briefly describes the main reasons that I was denied
So that I can understand the reason behind the denial
This format provides numerous benefits:
- User stories are easy-to-understand.
- User stories force you to look at deliverables from the lens of the stakeholders.
- The short nature of the stories helps facilitate prioritization and follow-up conversations.
- User stories shift the focus toward conversations among the stakeholders and data science team.
Note that user stories might remove some of the burdens of detailed documentation but they won’t replace it. For example, software testers will need to develop a library of test use cases. Legal might need detailed documentation to comply with regulations. And your project contract might include a service-level agreement.
Data Science Model Documentation
We’ve covered guiding principles and some documentation to support the project plan. Let’s now focus on the data science model documentation.
8. Document the data
Proper data documentation can answer several questions such as:
- What data is being used for the model?
- Why was this data selected (and other data sets excluded)?
- How was the data obtained?
- What are known issues in the data?
- What does the data look like? (mean, median, mode, skewness, expected data volume, etc.)
- How did you alter the data (transformations, imputations, other data cleaning techniques applied, etc.)
- Where is the data located?
- How frequently is the data refreshed?
- Is the data usage compliant with user agreements, data privacy best practices, and relevant regulations? (if not, don’t use it)
- What security protections do you have data-at-rest and data-in-motion to ensure compliance and data privacy?
Data documentation will help in many ways. These last two questions help ensure that data is being used ethically and responsibly (Related: 10 Data Science Ethics Questions). Moreover, the data issues, exploratory data analyses, and corrections can help you troubleshoot issues during the modeling phase. Even more broadly, the documentation will help others who might want to use the same data for future uses. A data dictionary is a great way to encourage data reuse and enforce data standards.
9. Document your experimental design
Moving on to something that is close to each data scientist’s heart – the scientific method. This core process runs through the cycle of making a hypothesis, running an experiment, and measuring the results. Most data science projects will likewise flow through these steps – often looping through them several times. Document each one of the following before running the experiment (perhaps as accompanying detail in a user story).
- Your testable hypothesis
- The assumptions you made
- The target variable
- The control/test split
- The validation set
- (if relevant) The experimentation time window
At the end of the experiment (and possibly at occasional checkpoints during it), document the results – both from a statistical and a business impact perspective. Use this information to guide the design of potential follow-up experiments and project work.
10. Document the algorithms
As part of your data science model documentation, you should document the algorithms used.
A great practice is to also include techniques that you attempted but decided not to use. This will help you look back and keep track of the decisions you made. It will also help you share knowledge with and educate other members of your team.
For many use cases, you might also want to document the biggest drivers for the model. Sometimes it’s even required by law. For example, credit card companies in the USA need to explain why an applicant was denied a credit card. In this scenario, you’ll need to detail why the model made each specific decision. Even if not legally required, documenting model drivers can help. For example, a retention team would like to know more than just the likelihood of a customer churning, but also why they might churn.
Supporting Systems Documentation
11. Document the code
How does your code work? You can clearly explain that just after you wrote it. But that might be tough a year from now when you tweak your data pipeline or retrain the model. It’ll even be tougher if you’re picking up someone’s else code who recently left your team.
You should always comment your code to help build a maintainable codebase. For Python data science project documentation, use # to for single-line comments and “” for multi-line comments to clarify anything potentially ambiguous such as the purpose of a variable or a function.
Wikis, README files, Word documents, or Google Docs can also be great ways to provide higher, project-level documentation. However, if you go this route, be sure to update these documents with any sizeable update to the codebase.
12. Document the infrastructure
If you’re delivering one-off analyses, you could skip this. But production-grade models will need it. In fact, per a Google research paper, the vast majority of code in a machine learning system stems from the supporting infrastructure.
Infrastructure documentation will help with both preventative and correction system maintenance.
Preventative Maintenance: Software grows old. New security threats arise all the time. Models and data will drift. Documenting how to best maintain the system in advance will you keep it running smoothly. Consider documenting items such as…
- Cadence to check the model for re-training
- Calendar of key events like SSL certificate expiry or cloud resource budget planning cycles
- Software versions used
- How to scale your system to support increased product usage (could be planned in advance or automatic)
Corrective Maintenance: Your system will probably fail at some point. And if (when) it does, you’ll thank the person who put this documentation together for you so that you’re prepared with a response.
Artifact | Sample Questions |
---|---|
Service level agreement | What is the minimum system uptime threshold? What hours will the system be available? What is the response time mapped to the severity of issues? |
Alerts and notifications | What is considered a failure? (think in terms of the data, the model, and the software) What alarms are built into the system? How many times will the system re-attempt to run before an alarm is pushed? |
Runbook | Who will be notified when a failure occurs? How will they be notified? What should that person do? If the primary respondent is not available, who gets the escalation? |
13. Build user documentation
Don’t forget your users! Rather, be sure they know how to use your system.
If you have a user interface, a great practice is to put a help menu link in the upper right of your screen so that the user can navigate to find items such as:
- How do I control the visualization?
- What are the definitions for key measures and dimensions?
- When is the system available?
- Where do I report a bug or request a product feature?
Another common output for a model is via an API. In this case, write technical documentation so that the receiving-end engineers can build on top of your API. Include items such as definitions, endpoints, parameters, data formats, and response times.
Data Science Project Documentation Templates
14. Grab a pre-built template
There are a few templates that can help get you started.
- CRISP-DM: CRISP-DM is the most common data science life cycle and defines a series of documents you should develop throughout a data mining project. Warning – these documents tend toward a more traditional view of extensive documentation, and (given CRISP’s age) the CRISP-DM guide lacks modern deployment best practices. You can visit Patiegm’s Github page for a handy CRISP-DM documentation template.
- Microsoft’s TDSP: Microsoft takes a more modern documentation approach in its Team Data Science Process. The Charter and Exit Report are particularly useful, even if you do not use TDSP.
- Model Cards: In a 2019 research paper, Google introduced the concept of a Model Card to set a vision for standardized and transparent model reporting. Visit withgoogle.com for an overview.
- Data Science Checklist: Checklists are a great way to identify what needs to be done and to track the status of each task. As such, consider our data science project checklist.
15. Build your own data science documentation template
The reality is that your project, team, and organizational needs will deviate from the above templates. As such, use these as starting points toward creating your own data science documentation templates.
Congrats! You made it to the end. But your work is just getting started. Remember that these data science project documentation best practices do not apply to all circumstances. And your situation will likely require some additional practices not mentioned here. So to review:
- Know your audience
- Keep it simple but not too simple
- Document your plan
- Document your model
- Document your system
- Build and customize your own templates
Best of luck and reach out if you have some additional pointers you found useful.