What is SEMMA?
Data is used by businesses to achieve a competitive advantage, improve performance, and deliver more useful services to customers. The data we collect about our surroundings serve as the foundation for hypotheses and models of the world we live in.
Ultimately, data is accumulated to help in collecting knowledge. That means the data is not worth much until it is studied and analyzed. But hoarding vast volumes of data is not equivalent to gathering valuable knowledge. It is only when data is sorted and evaluated that we learn anything from it.
Thus, the SAS Institute developed SEMMA as the process of data mining. It has five steps (Sample, Explore, Modify, Model, and Access), earning the acronym of SEMMA. The data mining method can be used to solve a wide range of business problems, including fraud identification, customer retention and turnover, database marketing, customer loyalty, bankruptcy forecasting, market segmentation, as well as risk, affinity, and portfolio analysis.
The 5 Stages Of SEMMA
SEMMA is leveraged as an organized, functional toolset, or is claimed as such by SAS to be associated with their SAS Enterprise Miner initiative. While it is true that the SEMMA process is more ambiguous to those not using the tool, most regard it as a functional data mining methodology rather than a specific tool.
The process breaks down into its own set of stages. These include:
- Sample: This step entails choosing a subset of the appropriate volume dataset from a vast dataset that has been given for the model’s construction. The goal of this initial stage of the process is to identify variables or factors (both dependent and independent) influencing the process. The collected information is then sorted into preparation and validation categories.
- Explore: During this step, univariate and multivariate analysis is conducted in order to study interconnected relationships between data elements and to identify gaps in the data. While the multivariate analysis studies the relationship between variables, the univariate one looks at each factor individually to understand its part in the overall scheme. All of the influencing factors that may influence the study’s outcome are analyzed, with heavy reliance on data visualization.
- Modify: In this step, lessons learned in the exploration phase from the data collected in the sample phase are derived with the application of business logic. In other words, the data is parsed and cleaned, being then passed onto the modeling stage, and explored if the data requires refinement and transformation.
- Model: With the variables refined and data cleaned, the modeling step applies a variety of data mining techniques in order to produce a projected model of how this data achieves the final, desired outcome of the process.
- Access: In this final SEMMA stage, the model is evaluated for how useful and reliable it is for the studied topic. The data can now be tested and used to estimate the efficacy of its performance.
SEMMA vs KDD Process vs CRISP-DM
In comparing KDD and SEMMA, on a high level the parallels draw themselves. The Sample stage is relatively comparable to KDD’s Selection, and both the Pre-processing and Explore phases achieve the same basic function in their respective processes.
The Modification stage, much like the Transformation KDD equivalent is responsible for refining sorted data from the stage before it, and the Modeling phase is a loose equivalent to Data Mining (as defined by KDD) in the sense that it is when the collected, selected, and refined data is brought together through various tests in order to test the derived knowledge and illustrate it more visually. Finally, the Assess step of SEMMA is a near direct equivalent to the KDD’s evaluation phase, where the data mining/modeling results are tested for their efficacy, and previously unknown findings are funneled back to refine the cyclical process.
Data Science Process Alliance: Given the requests we’ve had for training, Jeff and I have helped launch the DSPA which can help you learn about better alternatives than waterfall for data science process:
Other Process Alternatives: This page is part of the learning center dedicated to exploring other general data science workflows and processes such as:
- Agile for Data Science which is the opposite philosophically from Waterfall
- Agile-Waterfall Hybrids which attempt to combine Agile and Waterfall
- Up-and-coming Emerging Approaches