Hi. I am running a regression tree ensemble, and despite specifying a categorical variable using name-value pair 'CategoricalPredictors', the resulting ‘Mdl’ object looks like it has not used the categorical variable.
Assuming that the default value (1/3) of 'NumVariablesToSample’, might be the reason why the categorical feature is excluded, I set 'NumVariablesToSample’ to ‘all', to make sure that all variables are contained in all trees.
T=1000; N=20;
X = randn(T,N);
b = zeros(N+1,1); b(2:6) = 2; b(7:11) = -3;
D = [repmat([1 2 3 4 5], [1 200])]';
dum = dummyvar(D);
theta = 1:5; theta = theta'/5^2;
Y = [ones(T,1) X]*b + dum*theta + randn(T,1)*0.1;
N = size([X D],2);
clear dum theta b
Ntrees = 500;
tr = templateTree('MinParentSize',250,'CategoricalPredictors',21,'NumVariablesToSample','all');
Mdl = fitrensemble([X D],Y,'Method','Bag','Learners',tr,'NumLearningCycles',Ntrees);
rsvm = false(N,Ntrees);
for i = 1:Ntrees
idx = unique(Mdl.Trained{i}.CutPredictorIndex);
idx(idx==0)=[];
rsvm(idx,i) = 1;
end
mean(sum(rsvm))
>> ans =
4.6040
Question 1: Despite setting 'NumVariablesToSample’ to ‘all’, when I extract the variables used in each tree (using Mdl.Trained{i}.CutPredictorIndex), on average only 5 out of the 21 features are included in each tree. I was expecting all 21 to be included in all trees. Why this is not the case?
In the Bagged Ensemble above, I further checked and none of the individual trees picks the categorical variable (i.e. variable 21). When instead I fit a Boosted Ensemble, the algorithm still does not pick all the variables (-only picks on average 6 variables). However, it turns out that the categorical variable (#21) is now included in a few of the individual trees.
tr = templateTree('MinParentSize',250,'CategoricalPredictors',21,'NumVariablesToSample','all');
Mdl = fitrensemble([X D],Y,'Method','LSBoost','Learners',tr,'NumLearningCycles',Ntrees);
rsvm = false(N,Ntrees);
for i = 1:Ntrees
idx = unique(Mdl.Trained{i}.CutPredictorIndex);
idx(idx==0)=[];
rsvm(idx,i) = 1;
end
mean(sum(rsvm))
>> ans =
6.3160
find(rsvm(21,:))
>> ans =
Columns 1 through 18
18 31 39 50 60 67 81 151 179 181 195 204 269 298 317 319 337 394
Question 2: Despite the fact that variable 21 is included in a number of trees, property 'CategoricalPredictors' is always empty. Can any explain why this is the case?
Mdl.CategoricalPredictors
>> ans = []
Mdl.Trained{18}.CutPredictorIndex
>> ans = []
Any insights are appreciated.