The SAS Institute developed SEMMA as the process of data mining. It has five steps (Sample, Explore, Modify, Model, and Assess), earning the acronym of SEMMA. You can use the SEMMA data mining methodology to solve a wide range of business problems, including fraud identification, customer retention and turnover, database marketing, customer loyalty, bankruptcy forecasting, market segmentation, as well as risk, affinity, and portfolio analysis.
Businesses use the SEMMA methodology on their data mining and machine learning projects to achieve a competitive advantage, improve performance, and deliver more useful services to customers. The data we collect about our surroundings serve as the foundation for hypotheses and models of the world we live in.
Ultimately, data is accumulated to help in collecting knowledge. That means the data is not worth much until it is studied and analyzed. But hoarding vast volumes of data is not equivalent to gathering valuable knowledge. It is only when data is sorted and evaluated that we learn anything from it.
Thus, SEMMA is designed as a data science methodology to help practitioners convert data into knowledge.
The 5 Stages Of SEMMA
SEMMA is leveraged as an organized, functional toolset, or is claimed as such by SAS to be associated with their SAS Enterprise Miner initiative. While it is true that the SEMMA process is more ambiguous to those not using the tool, most regard it as a functional data mining methodology rather than a specific tool.
The process breaks down into its own set of stages. These include:
- Sample: This step entails choosing a subset of the appropriate volume dataset from a vast dataset that has been given for the model’s construction. The goal of this initial stage of the process is to identify variables or factors (both dependent and independent) influencing the process. The collected information is then sorted into preparation and validation categories.
- Explore: During this step, univariate and multivariate analysis is conducted in order to study interconnected relationships between data elements and to identify gaps in the data. While the multivariate analysis studies the relationship between variables, the univariate one looks at each factor individually to understand its part in the overall scheme. All of the influencing factors that may influence the study’s outcome are analyzed, with heavy reliance on data visualization.
- Modify: In this step, lessons learned in the exploration phase from the data collected in the sample phase are derived with the application of business logic. In other words, the data is parsed and cleaned, being then passed onto the modeling stage, and explored if the data requires refinement and transformation.
- Model: With the variables refined and data cleaned, the modeling step applies a variety of data mining techniques in order to produce a projected model of how this data achieves the final, desired outcome of the process.
- Assess: In this final SEMMA stage, the model is evaluated for how useful and reliable it is for the studied topic. The data can now be tested and used to estimate the efficacy of its performance.
Don’t Miss Out on the Latest
Sign up for the Data Science Project Manager’s Tips to learn 4 differentiating factors to better manage data science projects. Plus, you’ll get monthly updates on the latest articles, research, and offers.
How Popular is SEMMA?
In four polls spanning from 2002 to 2014 from KDnuggets.com, respondents selected SEMMA 7 – 13% of the time. While significantly less than CRISP-DM, this represents the second most commonly selected pre-defined framework.
We conducted a similar poll on this site in 2020. SEMMA was only selected by a single person. This is not a true comparison to KDnuggets’ polls as our audience likely has different demographics and our result options and question were different.
However, anecdotally, we don’t encounter many practitioners who have even heard of SEMMA. And given its myopic focus (as discussed in the next section), SEMMA likely has fallen out of favor with more modern and comprehensive data science methodologies.
SEMMA vs KDD Process vs CRISP-DM
In comparing KDD and SEMMA, on a high level the parallels draw themselves. The Sample stage is relatively comparable to KDD’s Selection, and both the Pre-processing and Explore phases achieve the same basic function in their respective processes.
The Modification stage, much like the Transformation KDD equivalent is responsible for refining sorted data from the stage before it, and the Modeling phase is a loose equivalent to Data Mining (as defined by KDD) in the sense that it is when the collected, selected, and refined data is brought together through various tests in order to test the derived knowledge and illustrate it more visually. Finally, the Assess step of SEMMA is a near direct equivalent to the KDD’s evaluation phase, where the data mining/modeling results are tested for their efficacy, and previously unknown findings are funneled back to refine the cyclical process.
SEMMA is a rather myopic approach toward data science projects. It does its job at explaining the core technical steps of a machine learning life cycle. However, as data science projects enter mainstream organizations, a more comprehensive approach is needed. Some good starting points include:
Or to truly master data science project management, consider earning the Data Science Team Lead certification.
Curious? Read our White Paper
Learn the five unique challenges of data science projects and how to overcome them.
Get a grasp on CRISP-DM, Scrum, and Data Driven Scrum.
And understand how to leverage best practices to deliver data science outcomes.