Proper use of ClassificationTree.fit for categorical variables?

Question

the cyclist 2013년 11월 8일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/105398-proper-use-of-classificationtree-fit-for-categorical-variables

댓글: the cyclist 2013년 11월 8일

The documentation for fitting classification trees states that X needs to be a floating point array, but also indicates that X can represent categorical variables (using the 'CategoricalPredictors' Name-Value argument).

Is the proper way to handle this to

(1) take the categorical variable, e.g.

category1 = {'duck','duck','goose','squash','quartz'}';
category2 = {'animal','animal','animal','vegetable','mineral'}';

(2) run those through grp2idx()

numcat1 = grp2idx(category1);
numcat2 = grp2idx(category2);

(3) Embed those in my X:

X = [numcat1 numcat2 otherTrulyNumericalVariables]

(4) Identify those as categorical

tree = ClassificationTree.fit(X,Y,'CategoricalPredictors',[1 2])

Seems like that's probably right, but I'd love an expert to vet that idea. The documentation doesn't have a categorical example.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Ilya 2013년 11월 8일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/105398-proper-use-of-classificationtree-fit-for-categorical-variables#answer_114600

MATLAB Online에서 열기

Yes, this would be one way to accomplish this. You'd have to be careful when you convert new data to numeric for prediction. If the new data are missing a level (for example, 'goose' does not appear in the value set), grp2idx can return different indices for the same categorical values. One way to avoid this pitfall would be by using the nominal type and specifying the level order explicitly, for example:

category1 = nominal({'duck','duck','goose','squash','quartz'},...
      [],{'goose','squash','quartz' 'duck'})
numcat1 = double(category1)

Depending on how you get your data, you might find it easier to put your entire data (numeric and categorical variables) into a table or, if you are not in R2013b yet, into a dataset object and then extract numeric and categorical variables from that object.

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

the cyclist 2013년 11월 8일

Thanks! I didn't know about the nominal() command. I have typically solved the missing-level problem you describe by assigning the categories of the test data using the ismember() command against the unique categories of the training data (with the second argument of grp2idx).

댓글을 달려면 로그인하십시오.

Proper use of ClassificationTree.fit for categorical variables?

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

Community Treasure Hunt

Proper use of ClassificationTree.fit for categorical variables?

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기