## Anomaly Detection with Isolation Forest

### Introduction to Isolation Forest

The isolation forest algorithm [1] detects anomalies by isolating anomalies from normal points using an ensemble of isolation trees. Each isolation tree is trained for a subset of training observations, sampled without replacements. The algorithm grows an isolation tree by choosing a split variable and split position at random until every observation in a subset lands in a separate leaf node. Anomalies are few and different; therefore, an anomaly lands in a separate leaf node closer to the root node and has a shorter path length (the distance from the root node to the leaf node) than normal points. The algorithm identifies anomalies using anomaly scores defined based on the average path lengths over all isolation trees.

Use the `iforest`

function, `IsolationForest`

object, and `isanomaly`

object function for outlier detection and novelty
detection.

Outlier detection (detecting anomalies in training data) — Detect anomalies in training data by using the

`iforest`

function.`iforest`

builds an`IsolationForest`

object and returns anomaly indicators and scores for the training data. For an example, see Detect Outliers.Novelty detection (detecting anomalies in new data with uncontaminated training data) — Create an

`IsolationForest`

object by passing uncontaminated training data (data with no outliers) to`iforest`

, and detect anomalies in new data by passing the object and the new data to the object function`isanomaly`

. For each observation of the new data, the function finds the average path length to reach a leaf node from the root node in the trained isolation forest, and returns an anomaly identifier and score. For an example, see Detect Novelties.

### Parameters for Isolation Forests

You can specify parameters for the isolation forest algorithm by using the
name-value arguments of `iforest`

:

`NumObservationsPerLearner`

(number of observations for each isolation tree) — Each isolation tree corresponds to a subset of training observations. For each tree,`iforest`

samples`min(N,256)`

number of observations from the training data without replacement, where`N`

is the number of training observations. The isolation forest algorithm performs well with a small sample size because it helps to detect dense anomalies and anomalies close to normal points. However, you need to experiment with the sample size if`N`

is small. For an example, see Examine NumObservationsPerLearner for Small Data.`NumLearners`

(number of isolation trees) — By default, the`iforest`

function grows 100 isolation trees for the isolation forest because the average path lengths usually converge well before growing 100 isolation trees [1].

### Anomaly Scores

The isolation forest algorithm computes the anomaly score *s*(*x*) of an observation *x* by normalizing the path length *h*(*x*):

$$s(x)={2}^{-\frac{E[h(x)]}{c(n)}},$$

where *E*[*h*(*x*)] is the average path length over all isolation trees in the isolation
forest, and *c*(*n*) is the average path length of unsuccessful searches in a binary search
tree of *n* observations.

The score approaches 1 as

*E*[*h*(*x*)] approaches 0. Therefore, a score value close to 1 indicates an anomaly.The score approaches 0 as

*E*[*h*(*x*)] approaches*n*– 1. Also, the score approaches 0.5 when*E*[*h*(*x*)] approaches*c*(*n*). Therefore, a score value smaller than 0.5 and close to 0 indicates a normal point.

### Anomaly Indicators

`iforest`

and `isanomaly`

identify
observations with anomaly scores above the score threshold as anomalies. The
functions return a logical vector that has the same length as the input data. A
value of logical 1 (`true`

) indicates that the corresponding row of
the input data is an anomaly.

`iforest`

determines the threshold value (`ScoreThreshold`

property value) to detect the specified fraction (`ContaminationFraction`

name-value argument) of training observations as anomalies. By default, the function treats all training observations as normal observations.`isanomaly`

provides the`ScoreThreshold`

name-value argument, which you can use to specify the threshold. The default threshold is the value determined when the isolation forest is trained.

### Detect Outliers and Plot Contours of Anomaly Scores

This example uses generated sample data containing outliers. Train an isolation forest model and detect the outliers by using the `iforest`

function. Then, compute anomaly scores for the points around the sample data by using the `isanomaly`

function, and create a contour plot of the anomaly scores.

**Generate Sample Data**

Use a Gaussian copula to generate random data points from a bivariate distribution.

rng("default") rho = [1,0.05;0.05,1]; n = 1000; u = copularnd("Gaussian",rho,n);

Add noise to 5% of randomly selected observations to make the observations outliers.

noise = randperm(n,0.05*n); u(noise,1) = u(noise,1)*5;

**Train Isolation Forest and Detect Outliers**

Train an isolation forest model by using the `iforest`

function. Specify the fraction of anomalies in the training observations as 0.05.

[f,tf,scores] = iforest(u,ContaminationFraction=0.05);

`f`

is an `IsolationForest`

object. `iforest`

also returns the anomaly indicators (`tf`

) and anomaly scores (`scores`

) for the training data.

Plot a histogram of the score values. Create a vertical line at the score threshold corresponding to the specified fraction.

histogram(scores) xline(f.ScoreThreshold,"r-",join(["Threshold" f.ScoreThreshold]))

Check the performance of `iforest`

by plotting a receiver operating characteristic (ROC) curve and computing the area under the curve (AUC) value.

true_tf = false(n,1); true_tf(noise) = true; [X,Y,T,AUC] = perfcurve(true_tf,scores,true); AUC

AUC = 0.8384

plot(X,Y) xlabel("False positive rate") ylabel("True positive rate") title("ROC for Anomaly Detection")

**Plot Contours of Anomaly Scores**

Use the trained isolation forest model and the `isanomaly`

function to compute anomaly scores for 2-D grid coordinates around the training observations.

l1 = linspace(min(u(:,1),[],1),max(u(:,1),[],1)); l2 = linspace(min(u(:,2),[],1),max(u(:,2),[],1)); [X1,X2] = meshgrid(l1,l2); [~,scores_grid] = isanomaly(f,[X1(:),X2(:)]); scores_grid = reshape(scores_grid,size(X1,1),size(X2,2));

Create a scatter plot of the training observations and a contour plot of the anomaly scores. Flag true outliers and the outliers detected by `iforest`

.

idx = setdiff(1:1000,noise); scatter(u(idx,1),u(idx,2),[],[0.5 0.5 0.5],".") hold on scatter(u(noise,1),u(noise,2),"ro","filled") scatter(u(tf,1),u(tf,2),60,"ko") contour(X1,X2,scores_grid,"ShowText","on") legend(["Normal Points" "Outliers" "Detected Outliers"],Location="best") colorbar hold off

### Examine `NumObservationsPerLearner`

for Small Data

For each isolation tree, `iforest`

samples `min(N,256)`

number of observations from the training data without replacement, where `N`

is the number of training observations. Keeping the sample size small helps to detect dense anomalies and anomalies close to normal points. However, you need to experiment with the sample size if `N`

is small.

This example shows how to train isolation forest models for small data with various sample sizes, create plots of anomaly score values versus sample sizes, and visualize identified anomalies.

**Load Sample Data**

Load Fisher's iris data set.

`load fisheriris`

The data contains four measurements (sepal length, sepal width, petal length, and petal width) from three species of iris flowers. The matrix `meas`

contains all four measurements for 150 flowers.

**Train Isolation Forest with Various Sample Sizes**

Train isolation forest models with various sample sizes, and obtain anomaly scores for the training observations.

s = NaN(150,150); rng("default") for i = 3: 150 [~,~,s(:,i)] = iforest(meas,NumObservationsPerLearner=i); end

Divide the observations into three groups based on the average scores, and create plots of anomaly scores versus sample sizes.

score_threshold1 = 0.5; score_threshold2 = 0.55; m = mean(s,2,"omitnan"); ind1 = find(m < score_threshold1); ind2 = find(m <= score_threshold2 & m >= score_threshold1); ind3 = find(m > score_threshold2); figure t = tiledlayout(3,1); nexttile plot(s(ind1,:)') title(join(["Observations with average score < " score_threshold1])) nexttile plot(s(ind2,:)') title(join(["Observations with average score in [" ... score_threshold1 " " score_threshold2 "]"])) nexttile plot(s(ind3,:)') title(join(["Observations with average score > " score_threshold2])) xlabel(t,"Number of Observations for Each Tree") ylabel(t,"Anomaly Score")

The anomaly score decreases as the sample size increases for the observations whose average score values are less than 0.5. For the observations whose average score values are greater than 0.55, the anomaly score increases as the sample size increases and then the score converges roughly when the sample size reaches 50.

Detect anomalies in training observations by using isolation forest models with the sample sizes 50 and 100. Specify the fraction of anomalies in the training observations as 0.05.

[f1,tf1,scores1] = iforest(meas,NumObservationsPerLearner=50, ... ContaminationFraction=0.05); [f2,tf2,scores2] = iforest(meas,NumObservationsPerLearner=100, ... ContaminationFraction=0.05);

Display the observation indexes of the anomalies.

find(tf1)

`ans = `*7×1*
14
42
110
118
119
123
132

find(tf2)

`ans = `*7×1*
14
15
16
110
118
119
132

The two isolation forest models have five anomalies in common.

**Visualize Anomalies**

For the isolation forest model with the sample size 50, visually compare observation values between normal points and anomalies. Create a matrix of grouped histograms and grouped scatter plots for each combination of variables by using the `gplotmatrix`

function.

tf1 = categorical(tf1,[0 1],["Normal Points" "Anomalies"]); predictorNames = ["Sepal Length" "Sepal Width" ... "Petal Length" "Petal Width"]; gplotmatrix(meas,[],tf1,"kr",".x",[],[],[],predictorNames)

For high-dimensional data, you can visualize data by using only the important features. You can also visualize data after reducing the dimension by using t-SNE (t-Distributed Stochastic Neighbor Embedding).

Visualize observation values using the two most important features selected by the `fsulaplacian`

function.

idx = fsulaplacian(meas)

`idx = `*1×4*
3 4 1 2

gscatter(meas(:,idx(1)),meas(:,idx(2)),tf1,"kr",".x",[],"on", ... predictorNames(idx(1)),predictorNames(idx(2)))

Visualize observation values after reducing the dimension by using the `tsne`

function.

Y = tsne(meas); gscatter(Y(:,1),Y(:,2),tf1,"kr",".x")

## References

[1] Liu, F. T., K. M.
Ting, and Z. Zhou. "Isolation Forest," *2008 Eighth IEEE International Conference on
Data Mining*. Pisa, Italy, 2008, pp. 413-422.

## See Also

`iforest`

| `IsolationForest`

| `isanomaly`