Does the function ClassificationTree.fit automatically prune?

Asked by Niklas Axelsson on 30 Jun 2012
Latest activity Commented on by Ilya on 1 Jul 2012

Dear All,

I am currently trying to construct a classification tree for a variable Y using different explanatory variables X. I want to use CART and therefore try to use the function Classification.Tree.fit(Y,X) in MATLAB.

The thing is that my variable Y has two categories, 's' and 'n', where 'n' is very 'rare', meaning only ~5% of data is of this certain class. This means that the majority of the Ys are of the class 's'.

When constructing the tree, I get about 8-10 levels, where the terminal nodes have very few (or not many) predicted observations. Now, let the grown tree be denoted tree, so if I do the following: [~,~,~,bestLevel]=cvLoss(tree,'subtrees','all');

I get that bestLevel is the root (!), meaning every future predicted value would be of just one class... Could it be that my prediction values in X are bad, or am I doing something very wrong here?

I was also wondering: when constructing the initial tree - does the function Classification.Tree.fit() automatically prune the tree to an "optimal size" before returning it, or does it make a big a tree as possible and leaves this to the user to prune afterwards?

0 Comments

Niklas Axelsson

Products

No products are associated with this question.

1 Answer

Answer by Ilya on 30 Jun 2012
Accepted answer

I described strategies for learning on imbalanced data in this post http://www.mathworks.com/matlabcentral/answers/11549-leraning-classification-with-most-training-samples-in-one-category The easiest thing to do is set 'prior' to 'uniform'.

The optimal pruning level could be equal to the largest pruning level for some data. This is not necessarily an indication that something went wrong.

Take a look at 'MergeLeaves' and 'Prune' parameters in the doc for ClassificationTree.fit. The doc for 'Prune' says that ClassificationTree computes the optimal sequence of pruned subtrees. The tree is not pruned; just the optimal sequence is computed. The doc for 'MergeLeaves' says that ClassificationTree merges leaves that originate from the same parent node, and that give a sum of risk values greater or equal to the risk associated with the parent node. That is, ClassificationTree applies a minimal amount of pruning, just for the leaves. If the tree prunes by classification error (default), this amounts to merging leaves that share the most popular class per leaf.

2 Comments

Niklas Axelsson on 30 Jun 2012

Thanks for the answer! That makes sense. I am still not sure why the pruning algorithm decides to get rid of all the branches and just leave the root, very weird.

At the moment I am just creating my tree using: tree = ClassificationTree.fit(Y,X,'PredictorNames',{'...'})

I do want to use CART. Do you think it will make a difference of using 'fitensemble' or maybe 'TreeBagger' instead? And what is the difference? Is it better?

Ilya on 1 Jul 2012

Please refer to the doc for Statistics Toolbox for a description of fitensemble and TreeBagger. The User Guide section should answer your question.

Also, if you found my answer useful, please accept it.

Ilya

Contact us