I am currently trying to construct a classification tree for a variable Y using different explanatory variables X. I want to use CART and therefore try to use the function Classification.Tree.fit(Y,X) in MATLAB.
The thing is that my variable Y has two categories, 's' and 'n', where 'n' is very 'rare', meaning only ~5% of data is of this certain class. This means that the majority of the Ys are of the class 's'.
When constructing the tree, I get about 8-10 levels, where the terminal nodes have very few (or not many) predicted observations. Now, let the grown tree be denoted tree, so if I do the following: [~,~,~,bestLevel]=cvLoss(tree,'subtrees','all');
I get that bestLevel is the root (!), meaning every future predicted value would be of just one class... Could it be that my prediction values in X are bad, or am I doing something very wrong here?
I was also wondering: when constructing the initial tree - does the function Classification.Tree.fit() automatically prune the tree to an "optimal size" before returning it, or does it make a big a tree as possible and leaves this to the user to prune afterwards?
No products are associated with this question.
I described strategies for learning on imbalanced data in this post http://www.mathworks.com/matlabcentral/answers/11549-leraning-classification-with-most-training-samples-in-one-category The easiest thing to do is set 'prior' to 'uniform'.
The optimal pruning level could be equal to the largest pruning level for some data. This is not necessarily an indication that something went wrong.
Take a look at 'MergeLeaves' and 'Prune' parameters in the doc for ClassificationTree.fit. The doc for 'Prune' says that ClassificationTree computes the optimal sequence of pruned subtrees. The tree is not pruned; just the optimal sequence is computed. The doc for 'MergeLeaves' says that ClassificationTree merges leaves that originate from the same parent node, and that give a sum of risk values greater or equal to the risk associated with the parent node. That is, ClassificationTree applies a minimal amount of pruning, just for the leaves. If the tree prunes by classification error (default), this amounts to merging leaves that share the most popular class per leaf.