The SAS Institute developed SEMMA as the process of data mining. It has five steps (Sample, Explore, Modify, Model, and Assess), earning the acronym of SEMMA. You can use the SEMMA data mining methodology to solve a wide range of business problems, including fraud identification, customer retention and turnover, database marketing, customer loyalty, bankruptcy forecasting, market segmentation, as well as risk, affinity, and portfolio analysis.
Why SEMMA?
Businesses use the SEMMA methodology on their data mining and machine learning projects to achieve a competitive advantage, improve performance, and deliver more useful services to customers. The data we collect about our surroundings serve as the foundation for hypotheses and models of the world we live in.
Ultimately, data is accumulated to help in collecting knowledge. That means the data is not worth much until it is studied and analyzed. But hoarding vast volumes of data is not equivalent to gathering valuable knowledge. It is only when data is sorted and evaluated that we learn anything from it.
Thus, SEMMA is designed as a data science methodology to help practitioners convert data into knowledge.
The 5 Stages Of SEMMA
SEMMA is leveraged as an organized, functional toolset, or is claimed as such by SAS to be associated with their SAS Enterprise Miner initiative. While it is true that the SEMMA process is more ambiguous to those not using the tool, most regard it as a functional data mining methodology rather than a specific tool.
The process breaks down into its own set of stages. These include:
- Sample: This step entails choosing a subset of the appropriate volume dataset from a vast dataset that has been given for the model’s construction. The goal of this initial stage of the process is to identify variables or factors (both dependent and independent) influencing the process. The collected information is then sorted into preparation and validation categories.
- Explore: During this step, univariate and multivariate analysis is conducted in order to study interconnected relationships between data elements and to identify gaps in the data. While the multivariate analysis studies the relationship between variables, the univariate one looks at each factor individually to understand its part in the overall scheme. All of the influencing factors that may influence the study’s outcome are analyzed, with heavy reliance on data visualization.
- Modify: In this step, lessons learned in the exploration phase from the data collected in the sample phase are derived with the application of business logic. In other words, the data is parsed and cleaned, being then passed onto the modeling stage, and explored if the data requires refinement and transformation.
- Model: With the variables refined and data cleaned, the modeling step applies a variety of data mining techniques in order to produce a projected model of how this data achieves the final, desired outcome of the process.
- Assess: In this final SEMMA stage, the model is evaluated for how useful and reliable it is for the studied topic. The data can now be tested and used to estimate the efficacy of its performance.
How Popular is SEMMA?
In four polls spanning from 2002 to 2014 from KDnuggets.com, respondents selected SEMMA 7 – 13% of the time. While significantly less than CRISP-DM, this represents the second most commonly selected pre-defined framework.
Polls from KDnuggets.com
We conducted a similar poll on this site in 2020. SEMMA was only selected by a single person. This is not a true comparison to KDnuggets’ polls as our audience likely has different demographics and our result options and question were different.
SEMMA was selected only once. See full article.
However, anecdotally, we don’t encounter many practitioners who have even heard of SEMMA. And given its myopic focus as discussed in the next section, SEMMA likely has fallen out of favor with more modern and comprehensive data science methodologies.
SEMMA vs KDD Process vs CRISP-DM
The CRoss Industry Standard Process in Data Mining (CRISP-DM) and the Knowledge Discovery in Databases (KDD) Process are two similar data mining life cycles.
In comparing KDD and SEMMA, on a high level the parallels draw themselves. The Sample stage is relatively comparable to KDD’s Selection, and both the Pre-processing and Explore phases achieve the same basic function in their respective processes.
The Modification stage, much like the Transformation KDD equivalent is responsible for refining sorted data from the stage before it, and the Modeling phase is a loose equivalent to Data Mining (as defined by KDD) in the sense that it is when the collected, selected, and refined data is brought together through various tests in order to test the derived knowledge and illustrate it more visually. Finally, the Assess step of SEMMA is a near direct equivalent to the KDD’s evaluation phase, where the data mining/modeling results are tested for their efficacy, and previously unknown findings are funneled back to refine the cyclical process.
Learn More
SEMMA is a rather myopic approach toward data science projects. It does its job at explaining the core technical steps of a machine learning life cycle. However, as data science projects enter mainstream organizations, a more comprehensive approach is needed.
Some good starting points include: