Leraning classification with most training samples in one category

3 views (last 30 days)
My question is not Matlab specific but more theoretical.
I'm currently using boosting to create a two class classifier and my week classifiers are trees. While I have a fairly large number of training examples in both classes, most of them are in one single class. I have the intuition that this difference in the number of examples in each class for the training set would deviate the resulting classifier from a "fair" one, towards one that benefits the class with more examples.
Am I right? what are the accepted ways to cope with this issue?
Thanks in advance!

Accepted Answer

Ilya
Ilya on 14 Jul 2011
The answer depends on how you define a "fair" classifier. If the ultimate goal of your analysis is to minimize the overall classification error and if the class proportions in the training set are representative of the real world, you get an optimal classifier from your imbalanced data. If the class proportions in the training set are not what you normally expect or if you want to assign different costs for misclassification of the majority and minority classes, you would need to adjust your learning method accordingly.
In general, there are 4 ways of dealing with skewed data:
1. Adjusting class prior probabilities to reflect realistic proportions.
2. Adjusting misclassification costs to represent realistic penalties.
3. Oversampling the minority class.
4. Undersampling the majority class.
For binary classification, strategies 1 and 2 are equivalent.
If you use fitensemble or TreeBagger, the easiest thing would be to set 'prior' to 'uniform' for an equal mix or to whatever you like.
If you like oversampling or undersampling, nothing in official MATLAB is available out of the box. It wouldn't be too hard to code though.
For undersampling the majority class, personally I had good experience with RUSBoost:
Seiffert, C., Khoshgoftaar, T., Hulse, J.V., and Napolitano, A. (2008) Rusboost: Improving classification performance when training data is skewed, in International Conference on Pattern Recognition, pp. 1–4.
For oversampling the minority class, a popular method is SMOTE. You might want to look into its boosting extension.
Chawla, N., Bowyer, K., Hall, L., and Kegelmeyer, W. (2002) Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.
Chawla, N., Lazarevic, A., Hall, L., and Bowyer, K. (2003) Smoteboost: improving prediction of the minority class in boosting, in VIIth European Conference on Principles and Practice of Knowledge Discovery in Databases(PKDD ´ 03), Lecture Notes on Computer Science, vol. 2838, Springer-Verlag, Lecture Notes on Computer Science, vol. 2838, pp. 107–119.
  3 Comments
Ilya
Ilya on 2 Jul 2012
As I replied to your other post, the easiest thing to do would be to set 'prior' to 'uniform'.
An ensemble is most usually more accurate than a single tree, no matter if you learn on balanced or imbalanced data. In either case, you have to optimize your classifier to get the best result. The issue is selecting the right tree size to have enough sensitivity to the minority class, without over-training.
If you go with a single decision tree, you can set 'MinParent' to 1 to grow a deep tree and then find the optimal pruning level. If you want to use TreeBagger, you can use it with default parameters. Every tree in TreeBagger by default is grown to the deepest level, and the high variance is removed by averaging. If you go with one of the boosting algorithms available from fitensemble, you would need to optimize the tree size by playing with 'MinLeaf' or 'MinParent' option. The default for boosting is growing stumps (trees with two leaves), and stumps may not have enough sensitivity to the minority class. In that case, I would start by setting 'MinLeaf' to one half of observations in the minority class. It's impossible to say in advance if bagging or boosting would work best for you.
Ilya
Ilya on 3 Jul 2012
I should also mention that RUSBoost is one of fitensemble options in R2012b. Here is how you can get the 12b pre-release: http://www.mathworks.com/support/solutions/en/data/1-5NTATZ/index.html?solution=1-5NTATZ

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!