## Framework for Ensemble Learning

Using various methods, you can meld results from many weak learners into one high-quality ensemble predictor. These methods closely follow the same syntax, so you can try different methods with minor changes in your commands.

You can create an ensemble for classification by using `fitcensemble` or for regression by using `fitrensemble`.

To train an ensemble for classification using `fitcensemble`, use this syntax.

`ens = fitcensemble(X,Y,Name,Value)`
• `X` is the matrix of data. Each row contains one observation, and each column contains one predictor variable.

• `Y` is the vector of responses, with the same number of observations as the rows in `X`.

• `Name,Value` specify additional options using one or more name-value pair arguments. For example, you can specify the ensemble aggregation method with the `'Method'` argument, the number of ensemble learning cycles with the `'NumLearningCycles'` argument, and the type of weak learners with the `'Learners'` argument. For a complete list of name-value pair arguments, see the `fitcensemble` function page.

This figure shows the information you need to create a classification ensemble.

Similarly, you can train an ensemble for regression by using `fitrensemble`, which follows the same syntax as `fitcensemble`. For details on the input arguments and name-value pair arguments, see the `fitrensemble` function page.

For all classification or nonlinear regression problems, follow these steps to create an ensemble:

### Prepare the Predictor Data

All supervised learning methods start with predictor data, usually called `X` in this documentation. `X` can be stored in a matrix or a table. Each row of `X` represents one observation, and each column of `X` represents one variable or predictor.

### Prepare the Response Data

You can use a wide variety of data types for the response data.

• For regression ensembles, `Y` must be a numeric vector with the same number of elements as the number of rows of `X`.

• For classification ensembles, `Y` can be a numeric vector, categorical vector, character array, string array, cell array of character vectors, or logical vector.

For example, suppose your response data consists of three observations in the following order: `true`, `false`, `true`. You could express `Y` as:

• `[1;0;1]` (numeric vector)

• `categorical({'true','false','true'})` (categorical vector)

• `[true;false;true]` (logical vector)

• `['true ';'false';'true ']` (character array, padded with spaces so each row has the same length)

• `["true","false","true"]` (string array)

• `{'true','false','true'}` (cell array of character vectors)

Use whichever data type is most convenient. Because you cannot represent missing values with logical entries, do not use logical entries when you have missing values in `Y`.

`fitcensemble` and `fitrensemble` ignore missing values in `Y` when creating an ensemble. This table contains the method of including missing entries.

Data TypeMissing Entry
Numeric vector`NaN`
Categorical vector`<undefined>`
Character arrayRow of spaces
String array`<missing>` or `""`
Cell array of character vectors`''`
Logical vector(not possible to represent)

### Choose an Applicable Ensemble Aggregation Method

To create classification and regression ensembles with `fitcensemble` and `fitrensemble`, respectively, choose appropriate algorithms from this list.

• For classification with two classes:

• `'AdaBoostM1'`

• `'LogitBoost'`

• `'GentleBoost'`

• `'RobustBoost'` (requires Optimization Toolbox™)

• `'LPBoost'` (requires Optimization Toolbox)

• `'TotalBoost'` (requires Optimization Toolbox)

• `'RUSBoost'`

• `'Subspace'`

• `'Bag'`

• For classification with three or more classes:

• `'AdaBoostM2'`

• `'LPBoost'` (requires Optimization Toolbox)

• `'TotalBoost'` (requires Optimization Toolbox)

• `'RUSBoost'`

• `'Subspace'`

• `'Bag'`

• For regression:

• `'LSBoost'`

• `'Bag'`

For descriptions of the various algorithms, see Ensemble Algorithms.

This table lists characteristics of the various algorithms. In the table titles:

• Imbalance — Good for imbalanced data (one class has many more observations than the other)

• Stop — Algorithm self-terminates

• Sparse — Requires fewer weak learners than other ensemble algorithms

AlgorithmRegressionBinary ClassificationMulticlass ClassificationClass ImbalanceStopSparse
`Bag`×××
`AdaBoostM1` ×
`AdaBoostM2`  ×
`LogitBoost` ×
`GentleBoost` ×
`RobustBoost` ×
`LPBoost` ×× ××
`TotalBoost` ×× ××
`RUSBoost` ×××
`LSBoost`×
`Subspace` ××

`RobustBoost`, `LPBoost`, and `TotalBoost` require an Optimization Toolbox license. Try `TotalBoost` before `LPBoost`, as `TotalBoost` can be more robust.

#### Suggestions for Choosing an Appropriate Ensemble Algorithm

• Regression — Your choices are `LSBoost` or `Bag`. See General Characteristics of Ensemble Algorithms for the main differences between boosting and bagging.

• Binary Classification — Try `AdaBoostM1` first, with these modifications:

Data CharacteristicRecommended Algorithm
Many predictors`Subspace`
Skewed data (many more observations of one class)`RUSBoost`
Label noise (some training data has the wrong class)`RobustBoost`
Many observationsAvoid `LPBoost` and `TotalBoost`
• Multiclass Classification — Try `AdaBoostM2` first, with these modifications:

Data CharacteristicRecommended Algorithm
Many predictors`Subspace`
Skewed data (many more observations of one class)`RUSBoost`
Many observationsAvoid `LPBoost` and `TotalBoost`

For details of the algorithms, see Ensemble Algorithms.

#### General Characteristics of Ensemble Algorithms

• `Boost` algorithms generally use very shallow trees. This construction uses relatively little time or memory. However, for effective predictions, boosted trees might need more ensemble members than bagged trees. Therefore it is not always clear which class of algorithms is superior.

• `Bag` generally constructs deep trees. This construction is both time consuming and memory-intensive. This also leads to relatively slow predictions.

• `Bag` can estimate the generalization error without additional cross validation. See `oobLoss`.

• Except for `Subspace`, all boosting and bagging algorithms are based on decision tree learners. `Subspace` can use either discriminant analysis or k-nearest neighbor learners.

For details of the characteristics of individual ensemble members, see Characteristics of Classification Algorithms.

### Set the Number of Ensemble Members

Choosing the size of an ensemble involves balancing speed and accuracy.

• Larger ensembles take longer to train and to generate predictions.

• Some ensemble algorithms can become overtrained (inaccurate) when too large.

To set an appropriate size, consider starting with several dozen to several hundred members in an ensemble, training the ensemble, and then checking the ensemble quality, as in Test Ensemble Quality. If it appears that you need more members, add them using the `resume` method (classification) or the `resume` method (regression). Repeat until adding more members does not improve ensemble quality.

Tip

For classification, the `LPBoost` and `TotalBoost` algorithms are self-terminating, meaning you do not have to investigate the appropriate ensemble size. Try setting `NumLearningCycles` to `500`. The algorithms usually terminate with fewer members.

### Prepare the Weak Learners

Currently the weak learner types are:

• `'Discriminant'` (recommended for `Subspace` ensemble)

• `'KNN'` (only for `Subspace` ensemble)

• `'Tree'` (for any ensemble except `Subspace`)

There are two ways to set the weak learner type in an ensemble.

• To create an ensemble with default weak learner options, specify the value of the `'Learners'` name-value pair argument as the character vector or string scalar of the weak learner name. For example:

```ens = fitcensemble(X,Y,'Method','Subspace', ... 'NumLearningCycles',50,'Learners','KNN'); % or ens = fitrensemble(X,Y,'Method','Bag', ... 'NumLearningCycles',50,'Learners','Tree');```
• To create an ensemble with nondefault weak learner options, create a nondefault weak learner using the appropriate `template` method.

For example, if you have missing data, and want to use classification trees with surrogate splits for better accuracy:

```templ = templateTree('Surrogate','all'); ens = fitcensemble(X,Y,'Method','AdaBoostM2', ... 'NumLearningCycles',50,'Learners',templ);```

To grow trees with leaves containing a number of observations that is at least 10% of the sample size:

```templ = templateTree('MinLeafSize',size(X,1)/10); ens = fitcensemble(X,Y,'Method','AdaBoostM2', ... 'NumLearningCycles',50,'Learners',templ);```

Alternatively, choose the maximal number of splits per tree:

```templ = templateTree('MaxNumSplits',4); ens = fitcensemble(X,Y,'Method','AdaBoostM2', ... 'NumLearningCycles',50,'Learners',templ);```

You can also use nondefault weak learners in `fitrensemble`.

While you can give `fitcensemble` and `fitrensemble` a cell array of learner templates, the most common usage is to give just one weak learner template.

For examples using a template, see Handle Imbalanced Data or Unequal Misclassification Costs in Classification Ensembles and Surrogate Splits.

Decision trees can handle `NaN` values in `X`. Such values are called “missing”. If you have some missing values in a row of `X`, a decision tree finds optimal splits using nonmissing values only. If an entire row consists of `NaN`, `fitcensemble` and `fitrensemble` ignore that row. If you have data with a large fraction of missing values in `X`, use surrogate decision splits. For examples of surrogate splits, see Handle Imbalanced Data or Unequal Misclassification Costs in Classification Ensembles and Surrogate Splits.

#### Common Settings for Tree Weak Learners

• The depth of a weak learner tree makes a difference for training time, memory usage, and predictive accuracy. You control the depth these parameters:

• `MaxNumSplits` — The maximal number of branch node splits is `MaxNumSplits` per tree. Set large values of `MaxNumSplits` to get deep trees. The default for bagging is `size(X,1) - 1`. The default for boosting is `1`.

• `MinLeafSize` — Each leaf has at least `MinLeafSize` observations. Set small values of `MinLeafSize` to get deep trees. The default for classification is `1` and `5` for regression.

• `MinParentSize` — Each branch node in the tree has at least `MinParentSize` observations. Set small values of `MinParentSize` to get deep trees. The default for classification is `2` and `10` for regression.

If you supply both `MinParentSize` and `MinLeafSize`, the learner uses the setting that gives larger leaves (shallower trees):

`MinParent = max(MinParent,2*MinLeaf)`

If you additionally supply `MaxNumSplits`, then the software splits a tree until one of the three splitting criteria is satisfied.

• `Surrogate` — Grow decision trees with surrogate splits when `Surrogate` is `'on'`. Use surrogate splits when your data has missing values.

Note

Surrogate splits cause slower training and use more memory.

• `PredictorSelection``fitcensemble`, `fitrensemble`, and `TreeBagger` grow trees using the standard CART algorithm [11] by default. If the predictor variables are heterogeneous or there are predictors having many levels and other having few levels, then standard CART tends to select predictors having many levels as split predictors. For split-predictor selection that is robust to the number of levels that the predictors have, consider specifying `'curvature'` or `'interaction-curvature'`. These specifications conduct chi-square tests of association between each predictor and the response or each pair of predictors and the response, respectively. The predictor that yields the minimal p-value is the split predictor for a particular node. For more details, see Choose Split Predictor Selection Technique.

Note

When boosting decision trees, selecting split predictors using the curvature or interaction tests is not recommended.

### Call `fitcensemble` or `fitrensemble`

The syntaxes for `fitcensemble` and `fitrensemble` are identical. For `fitrensemble`, the syntax is:

`ens = fitrensemble(X,Y,Name,Value)`
• `X` is the matrix of data. Each row contains one observation, and each column contains one predictor variable.

• `Y` is the responses, with the same number of observations as rows in `X`.

• `Name,Value` specify additional options using one or more name-value pair arguments. For example, you can specify the ensemble aggregation method with the `'Method'` argument, the number of ensemble learning cycles with the `'NumLearningCycles'` argument, and the type of weak learners with the `'Learners'` argument. For a complete list of name-value pair arguments, see the `fitrensemble` function page.

The result of `fitrensemble` and `fitcensemble` is an ensemble object, suitable for making predictions on new data. For a basic example of creating a regression ensemble, see Train Regression Ensemble. For a basic example of creating a classification ensemble, see Train Classification Ensemble.

#### Where to Set Name-Value Pairs

There are several name-value pairs you can pass to `fitcensemble` or `fitrensemble`, and several that apply to the weak learners (`templateDiscriminant`, `templateKNN`, and `templateTree`). To determine which name-value pair argument is appropriate, the ensemble or the weak learner:

• Use template name-value pairs to control the characteristics of the weak learners.

• Use `fitcensemble` or `fitrensemble` name-value pair arguments to control the ensemble as a whole, either for algorithms or for structure.

For example, for an ensemble of boosted classification trees with each tree deeper than the default, set the `templateTree` name-value pair arguments `MinLeafSize` and `MinParentSize` to smaller values than the defaults. Or, `MaxNumSplits` to a larger value than the defaults. The trees are then leafier (deeper).

To name the predictors in a classification ensemble (part of the structure of the ensemble), use the `PredictorNames` name-value pair in `fitcensemble`.