A High Level Machine Learning Process
A high level view of the steps in the machine learning process was described in our post on machine learning life cycles. In short, this workflow includes problem exploration, data engineering, model engineering and ML Ops.
The Benefit of a More Detailed Machine Learning Process
While this high level workflow (which some people refer to as a life cycle) is helpful for providing an overall summary of the phases in a machine learning project, it does not provide an intuitive explanation of the work required to actually create a predictive model.
In other words, a more detailed machine learning process could provide a better non-technical view the work required to build a machine learning model. This enables the entire team to have an intuitive understanding of the steps required to build a model, and hence, how to prioritize the work to be done, and how much time each step might take.
A More Detailed Machine Learning Process
This more detailed process keeps the same high level phases (problem exploration, data engineering, model engineering and ML Ops), but defines the key steps within each phase of the ML process. Below is a discussion on each of the steps in the process.
Problem Exploration
First focus on how the model will be used. In the process, assess the desired model accuracy and explore other details, such as if false positives are worse than false negatives. This phase also includes understanding what data might be available.
- Define Success: Define the problem to be solved. For example, what should be predicted. This helps define what data will be needed. Also, make sure it’s clear how success will be measured.
- Evaluate Data: Determine what are the relevant data sources. In other words, evaluate what data the team will need, how that data is collected, and where the data is stored.
Data Engineering
Design and build data pipelines. These pipelines get, clean and transform data into a format that is more easily used to build a predictive model. Note that this data might be coming from multiple data sources, so merging the data is also a key aspect of data engineering. This is often where the most time is spent in an ML project.
- Obtain Data: Assembling the data. This includes connecting to remove data stored and databases, which might be in different formats. For example, some data might be in CSV format, and other data could be available in JSON via web services.
- Scrub Data: The process of re-formatting particular attributes and correcting errors in data, such as missing values imputation. Datasets are often missing values, or they may contain values of the wrong type or range. Cleaning can include removing duplicates, correcting errors, dealing with missing values, normalization, and handling data type conversions.
- Explore / Validate Data: Get a basic understanding of the data. This exploratory analysis includes data profiling to obtain information about the content and structure of the data. The goal is to both understand the data attributes as well as the quality of the data.
Model Engineering
This is the phase that most people associate with building a machine learning model. During this phase, data is used to train and evaluate the model. This is often an iterative task, where the different models are tried, and the model is tuned.
- Select & Train Model: The process of identifying an appropriate model, and then building / training the model (on training data). The goal of training is to answer a question or make a prediction correctly as often as possible.
- Test Model: Run the model on data that the model has not yet seen (such as testing data). In other words, perform model testing by using data that was withheld from training (i.e., backtesting).
- Evaluate & Interpret Model: Objectively measure the performance of the model. Note that basic evaluation explores metrics such as accuracy and precision, to determine if the model is useable, and which model is best for the specific problem being explored. This evaluation also includes an understanding of when the model makes mistakes. More generally, validating the trained model helps to ensure the model meets original organizational objectives before the ML model is put into production.
- Tune Model: This step refers to parameter tuning, which, depending on the model being used, can be more an art than a science. In short, models typically have parameters (i.e., dials for tuning the model), which allows the model to get improved performance via parameter refinement. Simple model parameters may include attributes such as the number of training steps and the initialization of certain values.
ML Ops
Broadly defined, machine learning operations (ML Ops) spans a wide set of practices, systems, and responsibilities that data scientists, data engineers, cloud engineers, IT operations, and business stakeholders use to deploy, scale, and maintain machine learning solutions.
- Deploy Model: Package and put the model to use (i.e., into production). While this varies from one group to another, the team needs to understand the expected model performance, how the model will be monitored, and in general, key performance indicators (KPIs) of the model.
- Monitor Model: Maintain the model in production. This includes monitoring the KPIs and proactively working to ensure stable and robust predictions.
The Machine Learning Process Coordination Framework
When most people describe the machine learning process, they focus only on the steps required to build a predictive model (i.e., the steps just discussed) or more generally, the machine learning life cycle. This might be appropriate if the work is being done by one person, such as a researcher doing some analysis.
However, creating and using predictive models is increasingly becoming a team sport. And a modern data science team needs to define both the steps in doing the project as well as how to coordinate among the team members working on the project.
For example, note that the while the arrows in the diagram show a continuous flow, the team might need to go back to the previous phase / step. How does the team determine “when to move forward”, and “when to take a step back”? This is where a coordination framework can be useful.
Together, the steps of the project combined with a coordination framework create a comprehensive process that can guide the team toward successful project execution.
What is a Collaboration Framework?
Whereas the life cycle defines the steps necessary to complete a project, the coordination framework defines how the team coordinates these various steps.
The collaboration framework within an effective machine learning process encourages Agile principles, specifically the concept that small, incremental deliverables with quick planning cycles, to help support the ever-changing business landscape.
Three common agile collaboration frameworks that can work within the machine learning process
- Kanban – Simple, lightweight process centered on a highly visible board that describes the current flow of work
- Scrum – A popular software development framework based on fixed-time product releases
- Data Driven Scrum – A variant of Scrum designed specifically for data science with capability-based iterations
Wrap Up
To maximize the value and effort of a data science team creating predictive models, the team should use an appropriate machine learning process. That process should include both the steps of the project as well as an agile coordination framework.
Learn More
Explore some of our other blog posts:
- The machine learning life cycle
- The Data Driven Scrum Guide
- Effective data science processes
- What is ML Ops?
- What is the data science process?