How do I efficiently deal with the categorical variables in the TREFIT function in the Statistics Toolbox?

4 views (last 30 days)
Sometimes the TREEFIT function gives out of memory error when I use large categorical variables. Is there a size limit to the number of categories TREEFIT can handle?

Accepted Answer

MathWorks Support Team
MathWorks Support Team on 27 Jun 2009
The TREEFIT function does not impose a limit on the number of distinct values that a categorical variable can have. This function has to create arrays whose sizes depend on the number of categories, though, and MATLAB does have a limit on array sizes. The related solution listed below gives more information on this.
Here is a description on the issue of categorical variables. During a decision tree 'fitting procedure', at each node, a split occurs based on ONE variable. Suppose there are 100 distinct values of X. If X is numeric, then there are 99 possible ways to split at that node based on the value of X:
x<=X1, x<=X2, ..., x<=X99
If X is categorical, there are 2^99 -1 = 6.3383e+029 possible ways of splitting. This is the number of ways of assigning subsets of categories to the left and right nodes, with at least one category in each node. That is usually not what we want to do. Sometimes it may be, and in fact there may be simplifications so that the function would not have to examine all of these splits. Nevertheless, experience shows that people usually do not want to treat X as categorical if it has a large number of categories.
Suppose we really do want X to be categorical. The function creates an array of size 2^k-by-k, where k is the number of potential split points (one less than the number of distinct X values). On a typical computer, you can create a single array of that size if k=20, but run out of memory if k>20 (This can vary from computer to computer). That does not necessarily mean k=20 would work. There may be other arrays created and using up memory, for example.
If X is ordered, even if it is discrete, it is better not to define it as categorical. For example if X takes values:
1 = very low
2 = low
3 = medium
4 = high
5 = very high
then we probably do not want to create splits that lump "very low" and "high" together on one side, and the other three categories together on the other side. In cases like this do not define X to be categorical, and then the splits will respect the ordering of the categories.
Another possible workaround to this issue is to redefine the problem as a series of binary classifications. For binary classification TREEFIT will not try all possible splits of a categorical variable. The categorical variables are ordered like continuous variables.

More Answers (0)

Categories

Find more on Categorical Arrays in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!