removing outliers

Question

1 개 추천

Hi,

I have data which is by event for n number of companies (not time series data). Visually, I can see that there are outliers but I don't know which method to use to remove these outliers using matlab. Any help is appreciated

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

Richard Willey 2011년 4월 1일

MATLAB Online에서 열기

4 개 추천

Automatically detecting outliers is tricky stuff.

You normally need fairly precise information regarding your data as well as the model that you are fitting to your data.

Here's a relatively simple technique that will work for many types of linear models. The methodology is based on a statistics called "Cook's Distance" that you can extract from regstats.

Cook's Distance for a given data point measures the extent to which a regression model would change if this data point were excluded from the regression. Cook's Distance is sometimes used to suggest whether a given data point might be an outlier.

Here's a simple example illustrating how this works

% Create a vector of X values
X = 1:100;
X = X';
% Create a noise vector
noise = randn(100,1);
% Create a second noise value where sigma is much larger
noise2 = 10*randn(100,1);
% Substitute noise2 for noise1 at obs# (11, 31, 51, 71, 91)
% Many of these points will have an undue influence on the model 
noise(11:20:91) = noise2(11:20:91);
% Specify Y = F(X)
Y = 3*X + 2 + noise;
% Cook's Distance for a given data point measures the extent to 
% which a regression model would change if this data point 
% were excluded from the regression. Cook's Distance is 
% sometimes used to suggest whether a given data point might be an outlier.
% Use regstats to calculate Cook's Distance
stats = regstats(Y,X,'linear');
% if Cook's Distance > n/4 is a typical treshold that is used to suggest
% the presence of an outlier
potential_outlier = stats.cookd > 4/length(X);
% Display the index of potential outliers and graph the results
X(potential_outlier)
scatter(X,Y, 'b.')
hold on
scatter(X(potential_outlier),Y(potential_outlier), 'r.')

댓글 수: 2
없음 표시 없음 숨기기

Mark Shore 2011년 4월 1일

Looks interesting but unfortunately requires the Statistics Toolbox.

Anirudh Thatipelli 2018년 5월 24일

Thanks for referring to Cook's distance @Richard Wiley. It has been a great help for me in removing outliers.

댓글을 달려면 로그인하십시오.

Answer 2

Matt Fig 2011년 3월 26일

MATLAB Online에서 열기

0 개 추천

What form is the data? You might be able to use logical indexing. For example:

% x is some data with outliers 99 and -70.  We want only 0<x<10.
x = [2 3 2 3 1 4 2 3 4 99 2 3 2 -70];
x = x(x<10);  % Take those values less than 10
x = x(x>0);  % Take those values greater than zero.

.

You could also do this in one shot, as below.

% x is some data with outliers 99 and -70.  We want only 0<x<10.
x = [2 3 2 3 1 4 2 3 4 99 2 3 2 -70];
x = x(x<10 & x>0)

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

joseph Frank 2011년 3월 27일

the data is in % stock returns. it will be difficult to set a subjective cut off point. I am wondering if there is another way t determine what is outlier and what is not

댓글을 달려면 로그인하십시오.

Answer 3

Walter Roberson 2011년 3월 27일

0 개 추천

"outlier" is mathematically a matter of interpretation.

What is the outlier in this data?

1 2 3 1 2 3

Answer: 2, because the underlying process is believed to create 2 only 1 time in 1000 compared to 1 or 3, so for 2 to show up twice is unusual for this data.

But if you only had the data, how would you know that?

Thus, in order for a program to determine what is an "outlier" or not, you need to encode a model about what is "typical" data and what is not.

댓글 수: 5
이전 댓글 3개 표시 이전 댓글 3개 숨기기

joseph Frank 2011년 4월 1일

I have used 3 standard deviations away from the mean to remove outliers and I still have some.I have no clue how to compute the 1st derivative. If you have any instructions I will follow them to compute the 1st derivative

Walter Roberson 2011년 4월 1일

Sometimes it is more effective to compute deviations with a "leave one out" method: if this point was not already part of the dataset, how many deviations away from the mean would it be of the (smaller) dataset?

Three standard deviations is 99.7%; possibly for your purposes, a looser test such as 2.5 standard deviations is warranted.

댓글을 달려면 로그인하십시오.

removing outliers

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

답변 (3개)

댓글 수: 2
없음 표시 없음 숨기기

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

댓글 수: 5
이전 댓글 3개 표시 이전 댓글 3개 숨기기

카테고리

태그

Community Treasure Hunt

removing outliers

댓글 수: 0 이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

답변 (3개)

댓글 수: 2 없음 표시 없음 숨기기

댓글 수: 1 이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

댓글 수: 5 이전 댓글 3개 표시 이전 댓글 3개 숨기기

카테고리

태그

참고 항목

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글 수: 2
없음 표시 없음 숨기기

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

댓글 수: 5
이전 댓글 3개 표시 이전 댓글 3개 숨기기