Does the fitrtree algorithm treat categorical and continuous variables differently?

Question

ckara 2021년 2월 10일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/741957-does-the-fitrtree-algorithm-treat-categorical-and-continuous-variables-differently

답변: Haris K. 2021년 2월 11일

When fitting Regression trees (fitrtree) and Regression Tree ensembles (fitrensemble), categorical variables need to be specified using name-value pair 'CategoricalPredictors'. Based on that, is there any difference on how the algorithm treats the following two cases?

(Assume there is a variable D with 3 categories: 1,2,3)

- Case 1: Flag variable D as a categorical

- Case 2: Breaking D into 2 dummy variables and include along the continuous X's (i.e. do NOT flag as categorical)

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Haris K. 2021년 2월 11일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/741957-does-the-fitrtree-algorithm-treat-categorical-and-continuous-variables-differently#answer_621037

편집: Haris K. 2021년 2월 11일

MATLAB Online에서 열기

You can see if the two are equivalent using a simple example. The example below shows that Cases 1 & 2 are equivalent. NOTE however that when you have categorical variables mixed with continuous variables, the Standard CART algorithm may cause problems in Bagging (NOT in Boosting; see note (*)) when the objective of the analysis is Predictor Importance. Therefore, for these case we need to change to a different variable-splitting algorithm (e.g. change 'PredictorSelection' name-value pair from 'allsplits' to 'interaction-curvature').

% Make some data
T=100;
X = randn(T,1);
Y = randn(T,1);
D = [2 1 1 3 3 1 3 2 1 1 repmat(3,[60 1])' ones(30,1)']'; % Categorical variable
dum = dummyvar(D);

Here's case 1:

Mdl1 = fitrtree([X D],Y,'CategoricalPredictors',2);
resubLoss(Mdl1)
>> 0.4630

and case 2:

Mdl2 = fitrtree([X dum(:,1:2)],Y); % include only 2 of the total 3 dummies.
resubLoss(Mdl2)
>> 0.4630

Note (*):

See: https://www.mathworks.com/help/stats/improving-classification-trees-and-regression-trees.html#bvgp2zo-1

Take a look on the ‘When to specify’ column (of the predictor-selection techniques summary table), for the Standard CART algorithm. The table states that for boosting trees, we can use the Standard CART. However, take this note with a grain of salt, because if you look at the fitrtree documentation, which essentially is also the basis of the Boosting algorithm in fitrensemble, it says that you should avoid selecting the standard CART whenever you have ‘predictors that have relatively fewer distinct values than other predictors’. Therefore, maybe that holds true for both Bagging and Boosting trees. Possibly someone with more knowledge can add to the discussion.

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

ckara 2021년 2월 11일

MATLAB Online에서 열기

Thank you for your effort and the detailed response. I tried you example and indeed the two losses are the same when replicability is not a concern, but when setting rng('default'), the two in-sample MSE’s are different.

rng('default')
% rest of the code
resubLoss(Mdl1)
>> 0.5242
resubLoss(Mdl2)
>> 0.5215

Why is this happening?

댓글을 달려면 로그인하십시오.

Answer 2

Haris K. 2021년 2월 11일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/741957-does-the-fitrtree-algorithm-treat-categorical-and-continuous-variables-differently#answer_621122

MATLAB Online에서 열기

After your response, I tried to check with a more complicated example. Assume 1000 obs, 20 continuous features, and 1 categorical with 5 categories.

%Make some data
T=1000; N=20;
X = randn(T,N);
b = zeros(N+1,1); b(2:6) = 2; b(7:11) = -3;
D = [2 1 1 3 4 5 3 2 1 4  5*ones(1, 490)  repmat([1 2 3 4 5], [1 100])]'; %Categorical variable
dum = dummyvar(D);
theta = 1:5; theta = theta'/5^2;
Y = [ones(T,1) X]*b + dum*theta + randn(T,1)*0.1;

Check if the two cases you metnioned are equivalent:

%Case 1:
Mdl1 = fitrtree([X D],Y,'CategoricalPredictors',size(X,2)+1);
resubLoss(Mdl1)
>> 4.7955
%Case 2:
Mdl2 = fitrtree([X dum(:,1:end-1)],Y);
resubLoss(Mdl2)
>> 4.7465

The answer eventually is no. The two are not the same. As to why this is so, someone else might contribute. Hope this helps.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Does the fitrtree algorithm treat categorical and continuous variables differently?

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

답변 (2개)

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

Does the fitrtree algorithm treat categorical and continuous variables differently?

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

답변 (2개)

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기