Main Content

Clean Outlier Data

Find, fill, or remove outliers in the Live Editor

Since R2019b

Description

The Clean Outlier Data task lets you interactively handle outliers in data. The task automatically generates MATLAB® code for your live script.

Using this task, you can:

  • Find, fill, or remove outliers from data in a workspace variable.

  • Customize the methods for finding and filling outliers.

  • Visualize the outlier data and cleaned data.

Clean Outlier Data task in the Live Editor

Open the Task

To add the Clean Outlier Data task to a live script in the MATLAB Editor:

  • On the Live Editor tab, select Task > Clean Outlier Data.

  • In a code block in the script, type a relevant keyword, such as outlier or clean. Select Clean Outlier Data from the suggested command completions.

Examples

expand all

Interactively remove outliers from a table using the Clean Outlier Data task in the Live Editor.

Create a table using patient height and weight data from a sample file.

load("patients.mat","Height","Weight")
T = table(Height,Weight);
head(T)
    Height    Weight
    ______    ______

      71       176  
      69       163  
      64       131  
      67       133  
      64       119  
      68       142  
      64       142  
      68       180  

Open the Clean Outlier Data task in the Live Editor. To clean the patient data, select T as the input data. Then, compute on the Height and Weight variables by selecting All supported variables.

The Clean Outlier Data task can fill or remove outlier data. To remove the table rows corresponding to patients with outlier height or weight measurements, use the Cleaning method field to select Remove outliers. Then, to define outliers as elements below the 10th percentile or above the 90th percentile, use the Detection method field to select Percentiles.

Then, to visualize the cleaned height and weight data, use the Variable to display field to select all variables.

Live Task

Figure contains 2 axes objects. Axes object 1 with title Number of outliers cleaned: 8, ylabel Height contains 5 objects of type line. One or more of the lines displays its values using only markers These objects represent Input data, Cleaned data, Outliers, Removed by other variables, Outlier thresholds. Axes object 2 with title Number of outliers cleaned: 18, ylabel Weight contains 5 objects of type line. One or more of the lines displays its values using only markers These objects represent Input data, Cleaned data, Outliers, Removed by other variables, Outlier thresholds.

Related Examples

Parameters

expand all

This task operates on input data contained in a vector, table, or timetable. The data can be of type single or double.

For table or timetable input data, to clean all variables with type single or double, select All supported variables. To choose which single or double variables to clean, select Specified variables.

Specify the method for filling outliers as one of these options.

Fill MethodDescription
Linear interpolationLinear interpolation of neighboring, nonoutlier values
Constant valueSpecified scalar value, which is 0 by default
Convert to missingConvert to default definition of standard missing value
Center valueCenter value determined by the detection method
Clip to threshold valueLower threshold value for elements smaller than the lower threshold determined by the detection method; upper threshold value for elements larger than the upper threshold determined by the detection method
Previous valuePrevious nonoutlier value
Next valueNext nonoutlier value
Nearest valueNearest nonoutlier value
Spline interpolationPiecewise cubic spline interpolation
Shape-preserving cubic interpolation (PCHIP)Shape-preserving piecewise cubic spline interpolation
Modified Akima cubic interpolationModified Akima cubic Hermite interpolation

Specify the detection method for finding outliers as one of these options.

MethodDescription
Moving medianOutliers are defined as elements more than the specified threshold of local scaled median absolute deviations (MAD) from the local median over a specified window. The default threshold is 3.
MedianOutliers are defined as elements more than the specified threshold of scaled MAD from the median. The default threshold is 3. For input data A, the scaled MAD is defined as c*median(abs(A-median(A))), where c=-1/(sqrt(2)*erfcinv(3/2)).
MeanOutliers are defined as elements more than the specified threshold of standard deviations from the mean. The default threshold is 3. This method is faster but less robust than Median.
QuartilesOutliers are defined as elements more than the specified threshold of interquartile ranges above the upper quartile (75 percent) or below the lower quartile (25 percent). The default threshold is 1.5. This method is useful when the input data is not normally distributed.
GrubbsOutliers are detected using Grubbs’ test, which removes one outlier per iteration based on hypothesis testing. This method assumes that the input data is normally distributed.
Generalized extreme studentized deviate (GESD)Outliers are detected using the generalized extreme studentized deviate test for outliers. This iterative method is similar to Grubbs but can perform better when multiple outliers are masking each other.
Moving meanOutliers are defined as elements more than the specified threshold of local standard deviations from the local mean over a specified window. The default threshold is 3.
PercentilesOutliers are defined as elements outside of the percentile range specified by an upper and lower threshold. The default lower percentile threshold is 10, and the default upper percentile threshold is 90. Valid threshold values are in the interval [0, 100].

Specify the window type and size when the method for detecting outliers is Moving median or Moving mean.

WindowDescription
CenteredSpecified window length centered about the current point
AsymmetricSpecified window containing the number of elements before the current point and the number of elements after the current point

Window sizes are relative to the X-axis variable units.

Version History

Introduced in R2019b

expand all