manually subsetting data for training and testing purposes
조회 수: 7 (최근 30일)
이전 댓글 표시
I have a dataset containing locations, rows for each observation (month) and various climate data in the columns. Something that looks like this:
site_number month datacol1 datacol2 datacol3 etc...
1 Jan data1 data2 data3 ....
1 Feb data1 data2 data3 ...
....
1 Dec data1 data2 data3
...
2 Jan data1 data2 data3 ....
2 Feb data1 data2 data3 ....
.....
2 Dec data1 data2 data3
....
....
etc...
I want to create training and testing data for this dataset, and I want these datasets to be in blocks containing groups based on their individual site number (each site containing 12 observations for months in the rows).
To put into context I have 28 sites altogether, and want to cross validate using a testing dataset containing 3 sites (total of 36 rows), and a training dataset containing the other sites, grouped by site number.
Can anybody advise on how I can do this please?
댓글 수: 2
Chad Greene
2017년 10월 7일
Hi Roisin,
I'm not sure what you mean by training and testing data, or what you mean by cross validating. These seem like very general terms, but it sounds like they mean something specific to you. (This likely reflects my own ignorance of the methods you are trying to implement.)
If you want to create some array x containing all the data1 values corresponding to site 2, you could do so with
x = datacol1(site_number==2);
Does that answer your question?
답변 (2개)
Kian Azami
2017년 10월 8일
for making training and testing dataset you can use the following commands:
% First you make crossvalidation partitioning on your data
% y is a vector which contains the categories of your observations
% 'HoldOut' an optional property to make training and test set
% Fraction of data to form test set
c = cvpartition(y,'HoldOut',p)
% Now you can find the indices of your training and test sets
trainingIdx = training(c);
testIdx = test(c);
% Now you can find your training and test data
trainingData = Your_Data(trainingIdx,:);
testData = Your_Data(testIdx,:);
% Then you can learn from your training data
% y is the response variable in your data
Trained_Model = fitcknn(trainingData,'y')
% You can predict the Test data by the Trained_Model you defined
Pre_Test = predict(Trained_Model,testData)
% Finally you can calculate the error of your model for this test data
testErr = loss(Trained_Model,testData)
댓글 수: 9
Kian Azami
2017년 10월 8일
Roisin, I think you are confusing two different things which I do not understand.
You want to divide your data into test data and training data. I mean if you have 100 observations you keep 30 for test and 70 for training. But on the other hand, you want to have all the observations together! Maybe you have something else in your mind, but you do not explain it correctly.
Maybe you need to create another larger category (another excel column) and say that these first 10 sites are group1 and these second 10 sites are group2 and so on... Then you do the learning process by that larger groups.
Sorry, but these were the things that I had in my hand, to give you some ideas and suggestions;)
Drew
2024년 4월 13일
In R2023b, a "custom" partition functionality has been added to cvpartition. This functionality can be used to create a cvpartition that groups the sites together. See "Version History" section of cvpartition doc page.
R2023b: Create custom cross-validation partitions
The cvpartition function supports the creation of custom cross-validation partitions. Use the CustomPartition name-value argument to specify the test set observations. For example, cvpartition("CustomPartition",testSets) specifies to partition the data based on the test sets in testSets. The IsCustom property of the resulting cvpartition object is set to 1 (true).
댓글 수: 0
참고 항목
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!