Why does sequentialfs always outperform cross-validation with selected features?

Question

0 votes

Why does classification accuracy obtained using sequentialfs and cross-validation always outperform a 10-fold cross-validation using those selected features? Any help would be gratefully received!

Thanks in advance.

Barry

See code below, Acc_fs (77%) is always higher than Acc (67%): This finding holds true for muliple tests - accuracy obtained using sequentialfs always outperforms cross validated accuracy. Is this a bug in my implementation or an issue with sequentialfs.m?

%************** Perform feature selection ************
c = cvpartition(Labels,'k',num_folds);
opts = statset('display','iter');
fun = @(x_train,y_train,x_test,y_test)SVM_class_fun(x_train,y_train,x_test,y_test,kernel,rbf_sigma,boxconstraint);
[fs,history] = sequentialfs(fun,Data,Labels,'cv',c,'options',opts);
Acc_fs = 1 - history.Crit(end);
%******* Cross validated classification accuracy *******
Feature_select = find(fs==1);       % Features selected
Vars_select = Variables(fs==1);       % Variable names of features selected
indices = crossvalind('Kfold',Labels,num_folds);
Results = classperf(Labels, 'Positive', 1, 'Negative', 0);      % Initialize 
for i = 1:num_folds
    test = (indices == i); train = ~test;
    svmStruct = svmtrain(Data(train,Feature_select),Labels(train),'Kernel_Function','rbf','rbf_sigma',rbf_sigma,'boxconstraint',boxconstraint);      
    class = svmclassify(svmStruct,Data(test,Feature_select));          
  classperf(Results,class,test);  
end
Acc = Results.CorrectRate;      % Classification accuracy
end

Function SVM_class_fun returns number of misclassified samples:

function MCE = SVM_class_fun(x_train,y_train,x_test,y_test,kernel,rbf_sigma,boxconstraint)
svmStruct = svmtrain(x_train,y_train,'Kernel_Function','rbf','rbf_sigma',rbf_sigma,'boxconstraint',boxconstraint);
y_fit = svmclassify(svmStruct,x_test);
C = confusionmat(y_test,y_fit);
N = sum(sum(C));
MCE = N - sum(diag(C)); % No. misclassified sample
end

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Ilya on 24 Jan 2012

0 votes

I don't know if your code is correct. But accuracy estimates obtained by sequential selection are always biased high.

Consider say 10 random variables. Suppose you wish to find the variable with the largest true mean. Suppose these random variables are identical. Generate a separate sample for each variable. Due to the finite sizes of the samples, their estimated means are not going to be equal. You then choose the sample with the largest average and believe that the respective variable has the largest true mean. But all you did was choose the variable whose estimated mean came out largest by chance. Since the estimated mean is largest, it is likely above the true mean. Then you generate another sample for the chosen variable. Because the true mean is less than the estimated mean, your new estimate is less than your previous estimate.

This is exactly why you need to re-estimate the accuracy by another run of cross-validation after selection is done.

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Why does sequentialfs always outperform cross-validation with selected features?

0 Comments
Show -2 older comments Hide -2 older comments

Accepted Answer

0 Comments
Show -2 older comments Hide -2 older comments

More Answers (0)

Categories

Products

Tags

Community Treasure Hunt

Why does sequentialfs always outperform cross-validation with selected features?

0 Comments Show -2 older comments Hide -2 older comments

Accepted Answer

0 Comments Show -2 older comments Hide -2 older comments

More Answers (0)

Categories

Products

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

0 Comments
Show -2 older comments Hide -2 older comments