Confusino regarding statistical tests for given distribution

Question

Morten Nissov 2021년 6월 8일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/850710-confusino-regarding-statistical-tests-for-given-distribution

편집: dpb 2021년 6월 8일

temp.mat

MATLAB Online에서 열기

I am attempting to describe some data I have with a distribution but am experiencing some strange behavior.

I have a dataset which I would describe as "obviously normal", see the probplot and histfit for that

when running some tests I can confirm it is normal as well:

>> lillietest(data)
ans =
     0
>> pd = fitdist(data, 'normal'); kstest(data, 'cdf', pd)
ans =
  logical
   0
>> pd = fitdist(data, 'normal'); chi2gof(data, 'cdf', pd)
ans =
     1

I am just having a hard time understanding why this is failing the chi square goodness of fit test. I have >7000 data points and I simply cannot see how it can't be Gaussian.

Attached is the data.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

John D'Errico 2021년 6월 8일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/850710-confusino-regarding-statistical-tests-for-given-distribution#answer_719755

편집: John D'Errico 2021년 6월 8일

MATLAB Online에서 열기

It looks FAIRLY normal. But you have a whole crapload of data. It needs to look more normal than that. When you have a lot of data, it had better be darned tootin normal. Said differently, what those tests did not tell you is how badly does it fail?

[H,P] = chi2gof(data, 'cdf', pd)
H =
     1
P =
     0.032397

The default tolerance for the test was probably 0.05, if I had to guess. So your data was just over the line. By the way, if I look at that probplot, it has a few squiggles, that apparently were just enough to push it over the line.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Answer 2

dpb 2021년 6월 8일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/850710-confusino-regarding-statistical-tests-for-given-distribution#answer_719795

편집: dpb 2021년 6월 8일

MATLAB Online에서 열기

You need to examine what the default parameters of the function return --

>> [h,p,stats]=chi2gof(data)
h =
          1.00
p =
          0.03
stats = 
  struct with fields:
    chi2stat: 13.76
          df: 6.00
       edges: [-0.47 -0.28 -0.19 -0.10 -0.01 0.08 0.17 0.26 0.36 0.45]
           O: [50.00 211.00 833.00 1856.00 2133.00 1461.00 554.00 89.00 13.00]
           E: [35.88 225.21 855.97 1818.58 2162.62 1439.93 536.40 111.61 13.80]
>>

NB: there are only 6 DOF in the output statistic --

However, it does show what the normality plot shows to some degree, the LH tail is "heavy" with more observations at the lower extreme than expected. With the coarse binning, this is enough to reject the hypothesis at the default level of significance. (And, correspondingly, it is a little light on the RH end).

>> [h,p,stats]=chi2gof(data,"NBins",40)
h =
             0
p =
          0.57
stats = 
  struct with fields:
    chi2stat: 29.04
          df: 31.00
       edges: [1×35 double]
           O: [1×34 double]
           E: [1×34 double]
>> 

Let's look at how well the guess worked; one should have at least 5 observations in a bin(*)

>> [min(stats.O), max(stats.O);min(stats.E), max(stats.E)]
ans =
          5.00        550.00
          5.18        559.11
>> 

As the NIST handbook notes, one of the weaknesses of the Chi-Square is there is no optimal binning algorithm so the results can be sensitive to the choice made.

John d' has a very valid point that lots of data means can reject more easily...depending on the use of the data, it's probably such that these deviations from normality will not be very significant unless, of course, you're doing something like estimating from the tails in which a normal approximation will likely underestimate/overestimate the observed data frequency somewhat in the left/right tails, respectively.

(*) My memory refresher is always the NIST handbook; what it says about Chi-Square GOF is at https://www.itl.nist.gov/div898/handbook/eda/section3/eda35f.htm

I tend to rely upon the Shapiro-Wilk test which I don't believe TMW has implemented; I've a homebrew version I coded 40 years ago...

https://www.itl.nist.gov/div898/handbook/prc/section2/prc213.htm

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Confusino regarding statistical tests for given distribution

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

답변 (2개)

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

Confusino regarding statistical tests for given distribution

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

답변 (2개)

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기