Why does sequentialfs always outperform cross-validation with selected features?

Question

Barry Greene 2012년 1월 19일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/26499-why-does-sequentialfs-always-outperform-cross-validation-with-selected-features

Why does classification accuracy obtained using sequentialfs and cross-validation always outperform a 10-fold cross-validation using those selected features? Any help would be gratefully received!

Thanks in advance.

Barry

See code below, Acc_fs (77%) is always higher than Acc (67%): This finding holds true for muliple tests - accuracy obtained using sequentialfs always outperforms cross validated accuracy. Is this a bug in my implementation or an issue with sequentialfs.m?

%************** Perform feature selection ************
c = cvpartition(Labels,'k',num_folds);
opts = statset('display','iter');
fun = @(x_train,y_train,x_test,y_test)SVM_class_fun(x_train,y_train,x_test,y_test,kernel,rbf_sigma,boxconstraint);
[fs,history] = sequentialfs(fun,Data,Labels,'cv',c,'options',opts);
Acc_fs = 1 - history.Crit(end);
%******* Cross validated classification accuracy *******
Feature_select = find(fs==1);       % Features selected
Vars_select = Variables(fs==1);       % Variable names of features selected
indices = crossvalind('Kfold',Labels,num_folds);
Results = classperf(Labels, 'Positive', 1, 'Negative', 0);      % Initialize 
for i = 1:num_folds
    test = (indices == i); train = ~test;
    svmStruct = svmtrain(Data(train,Feature_select),Labels(train),'Kernel_Function','rbf','rbf_sigma',rbf_sigma,'boxconstraint',boxconstraint);      
    class = svmclassify(svmStruct,Data(test,Feature_select));          
  classperf(Results,class,test);  
end
Acc = Results.CorrectRate;      % Classification accuracy
end

Function SVM_class_fun returns number of misclassified samples:

function MCE = SVM_class_fun(x_train,y_train,x_test,y_test,kernel,rbf_sigma,boxconstraint)
svmStruct = svmtrain(x_train,y_train,'Kernel_Function','rbf','rbf_sigma',rbf_sigma,'boxconstraint',boxconstraint);
y_fit = svmclassify(svmStruct,x_test);
C = confusionmat(y_test,y_fit);
N = sum(sum(C));
MCE = N - sum(diag(C)); % No. misclassified sample
end

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Ilya 2012년 1월 24일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/26499-why-does-sequentialfs-always-outperform-cross-validation-with-selected-features#answer_35021

I don't know if your code is correct. But accuracy estimates obtained by sequential selection are always biased high.

Consider say 10 random variables. Suppose you wish to find the variable with the largest true mean. Suppose these random variables are identical. Generate a separate sample for each variable. Due to the finite sizes of the samples, their estimated means are not going to be equal. You then choose the sample with the largest average and believe that the respective variable has the largest true mean. But all you did was choose the variable whose estimated mean came out largest by chance. Since the estimated mean is largest, it is likely above the true mean. Then you generate another sample for the chosen variable. Because the true mean is less than the estimated mean, your new estimate is less than your previous estimate.

This is exactly why you need to re-estimate the accuracy by another run of cross-validation after selection is done.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Why does sequentialfs always outperform cross-validation with selected features?

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

Community Treasure Hunt

Why does sequentialfs always outperform cross-validation with selected features?

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기