Cannot use 'histogram' to compute entropy
이전 댓글 표시
I'd like to compute the entropy of various vectors. I was going to use something like:
X = randn(1,100);
h1 = histogram(X, 'Normalization', 'Probability');
probabilities = h1.Values;
entropy = -sum(probabilities .* log2(probabilities ))
The second command however gives the error:
Undefined function 'c:\Program Files\MATLAB\R2019b\toolbox\matlab\specgraph\histogram.m' for input arguments of type 'double'.
But surely that's exactly what the standard Matlab function 'histogram' expects?! Doing a
which histogram
indeed returns
C:\Program Files\MATLAB\R2019b\toolbox\matlab\specgraph\histogram.m
which is the newest file (by modified date) from several of that name that (sadly) exist in my Matlab folder. I believe this should be the standard Matlab function 'histogram'.
If on the other hand in the above example I use 'hist' instead of 'histogram', I get the scalar value for entropy that I expect. However, I know 'hist' is not recommended, not least because with it one cannot specify the normalization type.
So, my question is: is using 'hist' for computing probabilities ok, or should I try something else to be able to use 'histogram' instead?
댓글 수: 13
Walter Roberson
2021년 9월 9일
Please show the output of
dbtype histogram 1:5
Question: does your code just happen to assign a value to a variable named histogram at some point?
z8080
2021년 9월 9일
Bjorn Gustavsson
2021년 9월 9일
That you get a nan in the second variant is most likely because one of more of your probabilities are zero.
z8080
2021년 9월 9일
Walter Roberson
2021년 9월 9일
Right, you have to filter out the items with count 0.
Bjorn Gustavsson
2021년 9월 9일
편집: Bjorn Gustavsson
2021년 9월 9일
It is the zeros in the probabilities that leads to nans - you get terms of the form 0*log(0) in the entropy-calculation, which defaults to nan. Note that there's no zeros in the probabilities-vecor in your second example.
Bjorn Gustavsson
2021년 9월 9일
It should be simple enough to remove those zero-probability-bins:
probs = probabilities;
entropy = -sum(probs(probs(:)>0) .* log2(probs(probs(:)>0) ))
z8080
2021년 9월 9일
You have a finite sample of a distribution, and you are not specifying bin edges or the number of bins.
Under those circumstances, histogram() is documented as using the data to create bins of uniform width that represents the shape of the histogram. However, there is no documentation as to the algorithm it uses to select the bin widths (number of bins), and the relevant code is inside a .p file so we cannot look at it.
So you let histogram choose uniform bins in your finite distribution of data, using an unknown algorithm to select the bin widths, and some of the bins come up zero counts.
syms N positive
p = 1/10;
thresh = 1/100;
n = solve((1-p)^N == thresh)
vpa(n)
This calculates that if you have a bin with 10% probability, that you would have to take more than 43 samples before the probability dropped to less than 1/100 that the bin was empty. So, with finite samples, probability happens.
Walter Roberson
2021년 9월 10일
Depending on your knowledge of the distribution, it might make sense to take ask for the counts, and take max(1,counts) to substitute a nominal hit for each bin, and then calculate probability from that, as adjusted_counts ./ sum(adjusted_counts) .
The fewer samples you have, the more that distorts the probabilities; the more samples you have, the less likely you are to need it.
But I do recommend figuring out the number of bits yourself somehow or else you are going to continue to be at the mercy of its undocumented method of selecting the number of bins.
답변 (0개)
카테고리
도움말 센터 및 File Exchange에서 Data Distribution Plots에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!
