How can I remove outliers by using mahalanobis distance?

Question

Mooklada Chaisorn 2020년 9월 1일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/587417-how-can-i-remove-outliers-by-using-mahalanobis-distance

댓글: Mooklada Chaisorn 2020년 9월 6일

I have a normalized data table of 3568 rows and 24 columns. I calculate mahalanobis distance for each row of data using the code below. But how can I use mahalanobis distance I found to remove outliers?Is there any principle like distance above or below how many percent should be removed? Please advice me as I try to create several scenarios for my dataset.

For example,

scenario 0, just clean missing data but no outlier remove
scenario 1, remove outliers by using mean method
scenario 2, remove outliers by mahalanobis distance

Thank you for all your help

%DATA = 3568 x 24 table
k = size(DATA);     
n = k(1);           %row
m = k(2);           %column
Y = DATA;        
a = zeros(1,m);         %one observation 
b = zeros(n-1,m);       %new table dif dimension
c = zeros(1,m);
d_mahal_DATA = zeros(n,1);   %mahalonobis
format short e
for i=1:n
    if i==1
        a(i,:)=Y(i,:);          
        c = removerows(Y(i,:)); 
        Y(1,:)=[];
        d_mahal_DATA(i,:) = mahal(c,Y);
    elseif i>1
        a(i,:)=Y(1,:);          %row 1:i
        c = removerows(Y(1,:)); %row i   only
        Y(1,:)=[];              %row i+1 onwards
        b = [a(1:i-1,:);Y];     %row 1:i-1;i+1:-end (skip row i)
        d_mahal_DATA(i,:) = mahal(c,b);    
    end
end
d_mahal_DATA % size 3568 x 1

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Pratyush Roy 2020년 9월 4일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/587417-how-can-i-remove-outliers-by-using-mahalanobis-distance#answer_489664

MATLAB Online에서 열기

One can use p-values obtained from a chi-squared distribution to remove outliers using Mahalanobis Distance.

The p-values for the Mahalanobis distance array ‘d_mahal_DATA’ can be computed using the function chi2cdf available in Statistics and Machine Learning Toolbox.

P_val  = chi2cdf(d_mahal_DATA,n) % n denotes the degrees of freedom for the chi-squared distribution.

Perform a thresholding on this P_val array based on a certain significance value α for the distribution.

If P_val(i) is less than α for certain i, the ith data is to be considered an outlier. The value for alpha and n can be varied to obtain different thresholding for rejecting outliers.

Typically the values of alpha and n are taken as 0.05 and 2 respectively.

You can go through the following documentation link for further link:

https://www.mathworks.com/help/stats/chi2cdf.html

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

Mooklada Chaisorn 2020년 9월 6일

MATLAB Online에서 열기

Thank you so much! I'll try the one you suggest. But at first I try using isoutlier by percentiles [0,95]

dmh = array2table(d_mahal_DATA);
lowPercent  = 0; 
highPercent = 95; 
[outlierInd,pLow,pHigh,~] = isoutlier(dmh.d_mahal_DATA,"percentiles",[lowPercent, highPercent]);  
T_mahal = DATA(~outlierInd,:); % size 3390 x 24 table

I'm not sure if this considers removing too many outliers from the DATA

댓글을 달려면 로그인하십시오.

How can I remove outliers by using mahalanobis distance?

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

Community Treasure Hunt

How can I remove outliers by using mahalanobis distance?

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기