Leave-one-out cross-validation with svmtrain gives 'impossible' accuracy results

4 views (last 30 days)
I am using svmtrain to perform leave-one-out cross-validation on some data that I have access to, and I was noticing that some svm models generated were obtaining 0% accuracy for a binary classification problem involving hundreds of examples.
To perfectly pick the wrong binary choice that many times is essentially impossible, so I figured there was something wrong with my svm implementation. Therefore, I wrote a test program which generates a random feature matrix as training input and a random binary value as training output. Even with this set up, some svm models generated by svmtrain give 0% accuracy when the output is totally random and uncorrelated with the input.
Can anyone explain what I am doing wrong? I have included the test program source below:
%clear workspace
clear;
clc;
pause on;
%seed random
rng('default');
%initialize variables
n_sets=1000;
n_pairs=20;
n_features=2;
%initialize classification accuracy
accuracy=zeros(1,n_sets);
for i=1:n_sets
fprintf('\nSet #%i\n',i);
%generate random feature matrix
training_input=single(rand(n_pairs,n_features));
%generate random classification matrix
training_output=single(rand(n_pairs,1)>0.5);
%initialize correct counter
correct=0;
%Perform leave one out cross validation
for j=1:n_pairs
%define inputs for SVM model
model_training_input=training_input;
model_training_output=training_output;
%blind training on the jth row of the feature matrix
model_training_output(j)=NaN;
%generate SVM model from all of feature matrix other than jth row
svm_model=svmtrain(model_training_input,model_training_output,'autoscale',false);
%test model on the jth row
prediction=svmclassify(svm_model,training_input(j,:));
%check if prediction was correct
if(prediction==training_output(j)), correct=correct+1; end
end
accuracy(i)=correct/n_pairs;
fprintf('Accuracy = %s\n',num2str(accuracy(i)));
if(accuracy(i)==0||accuracy(i)==1)
fprintf('WTF\n');
pause;
end
end

Accepted Answer

Ilya
Ilya on 17 Apr 2012
You are missing my point about the majority class. Let me try again.
Suppose you generate 200 observations and assign labels at random. By chance you can get a situation in which exactly 100 observations are from one class (say A) and 100 observations are from the other class (say B). This probability is
>> binopdf(100,200,.5)
ans =
0.0563
substantial. This is your training set.
Now you remove one observation from the training set at a time. When you remove an observation of class A, you have 99 observations of class A and 100 observations of class B left. The SVM model cannot find a good decision boundary (because the classes are inseparable) and predicts everything into the majority class, that is, class B. The predicted label for the removed observation is incorrect.
Now you remove an observation of class B from the same set. Now you have 100 observations of class A and 99 of class B. The majority class is now A. Your SVM model predicts everything into A. Again the predicted label for the removed observation is incorrect.
Therefore the leave-one-out error for this training set (with two class sizes equal) is going to be 100%.
If you use if(prediction==single(rand(1)>0.5)), you get the expected accuracy because you are not comparing with the label of the removed observation.
It's possible something else is going on in your data, but I do not see any evidence of anything else going on in your description.
  1 Comment
Luke
Luke on 17 Apr 2012
I see what you are saying now, thanks! Just before I logged on here, I confirmed that this was the problem. I implemented the same setup using libsvm instead of matlab's svmtrain and ran into the same problem. For the classes with 0% prediction accuracy there are equal positive and negative classes in the training set.

Sign in to comment.

More Answers (1)

Ilya
Ilya on 17 Apr 2012
I don't know what went wrong with your real data. In this mock-up exercise, you are trying to separate two classes that are essentially inseparable. SVM often fails to find any good decision boundary and classifies everything into the majority class. You can see, for instance, that all 20 observations are used as support vectors; this is an indication that SVM is not doing anything useful. When you generate 20 observations, 10 in each class, and remove one of them, the majority class is opposite to the class of the observation you have removed. That's why the incorrect class is predicted more often than the correct class.
Generally, leave-one-out CV is not a good choice. 10-fold CV is usually better.
  1 Comment
Luke
Luke on 17 Apr 2012
Ilya, thanks for the feedback. However, I intentionally made the classes inseparable so the svm couldn't learn a pattern and predict the opposite result to get 0% accuracy. Also, although the class distribution is 50/50 each class assignment is independent of the next (ie if 9 are 0 there is still a 50% probability that the next class is 1), so for the algorithm to correctly predict the opposite class assignment is a probability of about 1 in 2^20 or 1 in one million (but it still does). Keep in mind this also happens when I let n_pairs = 100 or n_pairs = 200, so something strange is going on. Moreover, if I change if(prediction==training_output(j)) to if(prediction==single(rand(1)>0.5)) it achieves the expected distribution of accuracies. To the svm each new element of training_output is as random as rand(1), so this implies that somehow the svm is obtaining knowledge about training_input(j) when it is supposed to be blinded. I am wondering if there is some problem with svmtrain involving contiguous memory locations of the stored vector.

Sign in to comment.

Categories

Find more on Statistics and Machine Learning Toolbox in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!