How to check and remove outliers when it is Non-normal distribution

조회 수: 43 (최근 30일)
J1
J1 2015년 11월 18일
댓글: Star Strider 2015년 11월 19일
I found that many people say z-score and mapstd standardization is good to detect outlier. But z-score is useful when only it is normal distribution. When I found my data doesn't follow normal distribution. What should I do? (1)Should i transform my data(boxcox,Johnson transformation) into normal distribution and use z-score to detect outlier? (2)After transformation and remove the outliers, should I use my transformed data or original data(outliers removed in both data) to be the input of neural network? I found that if I input my transformed data(Johnson transformation) into neural network, it works worse than the original data.How come is it?
Can anybody help.Thanks a lot.

채택된 답변

Star Strider
Star Strider 2015년 11월 18일
The z-score is frequently used because according to the Central Limit Theorem, when the data are sufficiently numerous, the tend to be normally distributed regardless of the underlying distribution. (There is more to it that this simple statement, but that is the most basic explanation.)
If you know how your data are distributed, you can get the ‘critical values’ of the 0.025 and 0.975 probabilities for it and use them as your decision criteria to reject outliers. Again, outlier detection and rejection is another topic that goes beyond this simple explanation, and I encourage you to explore it on your own. How you decide to implement it with your data is something you will have to experiment with.
  댓글 수: 3
Greg Heath
Greg Heath 2015년 11월 19일
As I have mentioned in my answer
Using zscore is so useful for detecting outliers in
nonnormal distributions, I use it most of the time.
Again:
For outlier detection I recommend using the
combination of zscore and plots with all non-binary data.
Greg
Star Strider
Star Strider 2015년 11월 19일
My pleasure.
A data set n>30 will approximate a normal distribution if it is otherwise t-distributed, but you would have to look at your data to see if they approximate a normal distribution. If you have any doubts as to its distribution, I would use one of the histogram functions, and if you have the Statistics Toolbox, the histfit function.
The most reliable way to determine if your data are normally distributed is to use the Statistics Toolbox Kolmogorov-Smirnov test, implemented in the kstest function. Another related test for the normal and other distributions is the Lilliefors test, implemented in the lillietest function.
If you don’t have the Statistics Toolbox, one simple test is to see if the median approximates the mean. It should for normally-distributed data, but will not for other distributions. (I leave the interpretation of ‘approximates’ to you, in the context of your data. They should be virtually the same for normally-distributed data.) You can also use the randn function with the mean and std of your data, then use a histogram function to compare them. The randn call would be (with ‘data’ being your data):
data_mean = mean(data);
data_std = std(data);
data_sim = data_mean + data_std*randn(size(data));
If your data turn out to be normally-distributed, you can certainly use the z-score reliably to scale them or test them with respect to detecting outliers. In the limit (which is to say a huge number of observations), the CLT would certainly apply. However N=89 is not huge, so you will have to analyse your data and see how they are distributed.

댓글을 달려면 로그인하십시오.

추가 답변 (1개)

Greg Heath
Greg Heath 2015년 11월 18일
Regardless of the distribution, I find that a combination of zscore with plots of original and transformed data is sufficient for me to detect outliers. Whether points are deleted or replaced by a reduced value depends on how I interpret the plots.
If you have doubts you can always make multiple models based on original and modified data.
Hope this helps.
Thank you for formally accepting my answer
Greg
  댓글 수: 2
J1
J1 2015년 11월 19일
If we found there are outlier, should i find out more variables to predict my output? Such as, I use weather data to predict the sales of product.And I found that the outlier is due to the promotion or other reasons, should i add this new reasons(new variables) into the neural network to do prediction?
Greg Heath
Greg Heath 2015년 11월 19일
Outliers are usually isolated points that are the result of bad measurements or bad transcriptions. Therefore they should be removed. However, if you plot the data, very often you can guess the approximate true value of the measurement. Then you have the option of replacing the outlier with the approximation.

댓글을 달려면 로그인하십시오.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by