cvpartition grouped data for stratified hold-out classification?

조회 수: 11 (최근 30일)
Cuong Quang
Cuong Quang 2018년 9월 2일
댓글: Cuong Quang 2018년 9월 3일
Hi,
My ultimate objective is to build a machine learning classifier that takes student academic results and their school as input features and an alphabetic grade as the output.
I have tried using cvpartition to partition a 100 x N array into stratified 70% training and 30% hold-out testing for machine learning classification. The stratification is important because I want to maintain a similar class distribution for both the training and test sets as for the whole dataset.
cvpartition works when each row (sample) in my array is independent of all other rows (samples). For instance, this data are 100 students randomly picked from 100 different schools. The N variables are their average academic performance across all the subjects they study and their final alphabetic grade (A+,A,A-,B+,B,B-,C,D or F). In this situation, cvpartition readily partitions my original data for hold-out testing where none of the data was used in training.
However, I now have a 500 x N+1 where the rows (samples) are grouped where the new variable is their school. In this new dataset, 5 students are picked at random from 100 different schools and I want to train a learner that classifies students based also on their school. However, if I use cvpartition it will bias my results because it is possible that data from the same school (but for different students) was used both for training and for testing.
So, I want to create stratified 70% training and 30% hold-out testing datasets where none of the GROUPED data in testing was used for training, and I want to repeat this training/testing loop say 1000 times with random sets of training and hold-out testing data. I want to ensure that in each loop, student data from the schools used for testing are not from the schools used for training.
Is there a Matlab command that allows us to easily cvpartition grouped data in the manner I require? For instance, does the command diverand help in this case?
Thank you in advance.

답변 (1개)

ahmed nebli
ahmed nebli 2018년 9월 2일
편집: ahmed nebli 2018년 9월 2일
i think you should split you data into training and testing maualy without even using cv partition because as u said, cvpartition test on data that had been training with it.
to do so, create a function that splits your data and save the training and testing into a .mat file
  댓글 수: 1
Cuong Quang
Cuong Quang 2018년 9월 3일
Hi Ahmed,
Yes as you say there are more manual methods available.
What I'm wondering is whether there is the equivalent of a single cvpartition command for grouped data?
Thanks Cuong

댓글을 달려면 로그인하십시오.

카테고리

Help CenterFile Exchange에서 Support Vector Machine Regression에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by