fitcdiscr bug: Why does "ClassNames" now have to be provided in alphanumerical order otherwise accuracy is terrible?

Question

Leon 2024년 5월 16일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/2119701-fitcdiscr-bug-why-does-classnames-now-have-to-be-provided-in-alphanumerical-order-otherwise-accu

편집: Leon 2024년 5월 28일

[Update: There is a known bug in kfoldLoss. See answer and workaround from Mathworks technical support in the answers section]

I couldn't work out why I was getting terrible results (as if completely random) with fitcdiscr() and I've found out that it is because I wasn't specifying the ClassNames argument in alphabetical order. Comparing MATLAB 2024a to 2022, this is new behaviour and presumably a bug. One of the reasons to specify ClassNames can be to change the order for the results summary, etc.

Example code that gives terrible accuracy:

load fisheriris
Mdl = fitcdiscr(meas, species, "ClassNames", flip(unique(species)), "KFold", 10);
validationAccuracy = 1 - kfoldLoss(Mdl)
validationAccuracy = 0.3200

By simply removing the function flip(), the above code gives the expected accuracy of 0.98, otherwise it gives 0.32 (which is basically random for three classes). A side-effect of unique() is that it sorts the data into alphanumerical order, which isn't actually required for fisheriris because the observations happen to be in alphabetical order.

Here is some code that will give terrible accuracies if the order is randomised and the "stable" parameter is used to keep the random order:

load fisheriris
r = randperm(length(meas));  % Randomise the order of the occurrences
Mdl = fitcdiscr(meas(r,:), species(r), "ClassNames", unique(species(r), "stable"), "KFold", 10);
validationAccuracy = 1 - kfoldLoss(Mdl)
validationAccuracy = 0.3267

If run a few times, I get results like:

validationAccuracy =

0.3267

validationAccuracy =

0.0067

validationAccuracy =

0.0067

validationAccuracy =

0.3267

validationAccuracy =

0.9800

Update: I have now been able to test the code on another computer that still has MATLAB 2022 installed and it gives the correct accuracy with the above code, so this appears to be a bug in the latest version of MATLAB! I have reported it to Mathworks.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Athanasios Paraskevopoulos 2024년 5월 17일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/2119701-fitcdiscr-bug-why-does-classnames-now-have-to-be-provided-in-alphanumerical-order-otherwise-accu#answer_1459166

편집: Athanasios Paraskevopoulos 2024년 5월 17일

MATLAB Online에서 열기

Code with Issue:

load fisheriris
Mdl = fitcdiscr(meas, species, "ClassNames", flip(unique(species, "stable")), "KFold", 10);
validationAccuracy = 1 - kfoldLoss(Mdl)
validationAccuracy = 0.3200

Correct Code:

load fisheriris
Mdl = fitcdiscr(meas, species, "ClassNames", unique(species, "stable"), "KFold", 10);
validationAccuracy = 1 - kfoldLoss(Mdl)
validationAccuracy = 0.9800

Your observation indicates a potential bug in the latest version of MATLAB that should be reported to MathWorks. Until the issue is resolved, always specify the ClassNames argument in alphabetical order to ensure correct behavior.

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

Leon 2024년 5월 17일

편집: Leon 2024년 5월 17일

You've just repeated exactly what I said. That's not an answer. I literally said that removing flip() fixed the problem and it is obvious that I only put it there to demonstrate the problem. I also specified that sorting the ClassNames alphanumerically corrects the accuracy, and that I believe this is a bug (which I have reported to Mathworks). Furthermore, the actual functionality of ClassNames to specify the order of the class names for results, etc., remains broken regardless.

댓글을 달려면 로그인하십시오.

Answer 2

Leon 2024년 5월 28일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/2119701-fitcdiscr-bug-why-does-classnames-now-have-to-be-provided-in-alphanumerical-order-otherwise-accu#answer_1464521

편집: Leon 2024년 5월 28일

MATLAB Online에서 열기

I received the following answer and workaround (using kfoldPredict) from Mathworks technical support:

The bug here lies within the "kfoldLoss" function, rather than within "fitcdiscr". This is a bug that the development team is aware of and is investigating. In the meantime, you can compute the loss by comparing the predicted and true class labels. For example, the following code will always return 0.98 in R2024a (this would not be the case using "kfoldLoss"):

load fisheriris
species = categorical(species);  % Species is a cell array so I convert it to a vector
Mdl = fitcdiscr(meas, species, "ClassNames", flip(unique(species)), "KFold", 10);
predictedLabels = kfoldPredict(Mdl);
correctPredictions = sum(predictedLabels == species);
valAcc = correctPredictions/numel(species)

In my code, only the model is passed to a function, so I no longer have access to the response variable directly, but this works:

predictedLabels = kfoldPredict(Mdl);
correctPredictions = sum(predictedLabels == categorical(Mdl.Y));
valAcc = correctPredictions / numel(Mdl.Y);

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

fitcdiscr bug: Why does "ClassNames" now have to be provided in alphanumerical order otherwise accuracy is terrible?

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

답변 (2개)

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

fitcdiscr bug: Why does "ClassNames" now have to be provided in alphanumerical order otherwise accuracy is terrible?

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

답변 (2개)

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기