IBM 15 User Manual

Understanding Data Mining

Classification nodes

The Auto Classifier node creates and compares a number of different models for

binary outcomes (yes or no, churn or do not churn, and so on), allowing you to
choose the best approach for a given analysis. A number of modeling algorithms are
supported, making it possible to select the methods you want to use, the specific

options for each, and the criteria for comparing the results. The node generates a set
of models based on the specified options and ranks the best candidates according to

the criteria you specify.

The Auto Numeric node estimates and compares models for continuous numeric
range outcomes using a number of different methods. The node works in the same
manner as the Auto Classifier node, allowing you to choose the algorithms to use

and to experiment with multiple combinations of options in a single modeling pass.
Supported algorithms include neural networks, C&R Tree, CHAID, linear regression,
generalized linear regression, and support vector machines (SVM). Models can be
compared based on correlation, relative error, or number of variables used.

The Classification and Regression (C&R) Tree node generates a decision tree that

allows you to predict or classify future observations. The method uses recursive
partitioning to split the training records into segments by minimizing the impurity
at each step, where a node in the tree is considered “pure” if 100% of cases in the
node fall into a specific category of the target field. Target and input fields can

be numeric ranges or categorical (nominal, ordinal, or flags); all splits are binary

(only two subgroups).

The QUEST node provides a binary classification method for building decision trees,

designed to reduce the processing time required for large C&R Tree analyses while
also reducing the tendency found in classification tree methods to favor inputs that

allow more splits. Input fields can be numeric ranges (continuous), but the target field

must be categorical. All splits are binary.

The CHAID node generates decision trees using chi-square statistics to identify
optimal splits. Unlike the C&R Tree and QUEST nodes, CHAID can generate
nonbinary trees, meaning that some splits have more than two branches. Target and
input fields can be numeric range (continuous) or categorical. Exhaustive CHAID is

a modification of CHAID that does a more thorough job of examining all possible

splits but takes longer to compute.

The C5.0 node builds either a decision tree or a rule set. The model works by splitting
the sample based on the field that provides the maximum information gain at each

level. The target field must be categorical. Multiple splits into more than two

subgroups are allowed.

The Decision List node identifies subgroups, or segments, that show a higher or lower

likelihood of a given binary outcome relative to the overall population. For example,
you might look for customers who are unlikely to churn or are most likely to respond
favorably to a campaign. You can incorporate your business knowledge into the
model by adding your own custom segments and previewing alternative models side
by side to compare the results. Decision List models consist of a list of rules in which
each rule has a condition and an outcome. Rules are applied in order, and the first rule

that matches determines the outcome.

Linear regression models predict a continuous target based on linear relationships
between the target and one or more predictors.