Split dataset into three different size sets without overlapping

조회 수: 2 (최근 30일)
david jones
david jones 2016년 9월 3일
답변: Frank B. 2018년 5월 8일
I am working on image processing using Matlab. I need to split a large dataset into three non-overlapped subsets (25%, 25% and 50%). The dataset (let's say has 1K images) has 10 classes (each has 100 images). from class 1, 25% of images should be in the training set, other 25% should be stored in the validation set and the rest (50%) should be stored in the testset. there should not repetition. I mean if an image from a class has been stored in a subset, it must not be stored in other subsets of the class. How do I do that in Matlab?
My code is as follows:
load ('data.mat')
for i = 1:size(data, 1)
for j = 1:78
if mod(i,2)==0
trainingset(i/2,j) = data(i,j);
else
remainset((i-1)/2+1,j) = data(i,j);
end
end
end
for i = 1:size(remainset, 1)
for j = 1:78
if mod(i,2)==0
testset(i/2,j) = remainset(i,j);
else
validationset((i-1)/2+1,j) = remainset(i,j);
end
end
end
Although it somehow works, I am looking for a better algorithm as some parts of data are lost.
  댓글 수: 2
david jones
david jones 2016년 9월 3일
As I need to split the data into three subsets, using 'datasample', it calculates indices for one subset. But, if I use it again to calculate indices of other subsets, is likely to have duplicate indices in different subset. I can use randperm, but the same issue exists. I need to split the dataset into three different subsets that each of the subsets contains a percentage of each class of data. using simple sampling method like using
1:250,1:250,1:500
does not work as the subsets have members of some of the classes. Example: subset 1 should have 25% of class1, 25% of class 2, 25% of class 3, ... , 25% of class n. subset 2 should have 25% of class1, 25% of class 2, 25% of class 3, ... , 25% of class n. subset 3 should have 50% of class1, 50% of class 2, 50% of class 3, ... , 50% of class n.
intersection of subset 1,subset 2 and subset 3 must be zero and union of subsets must cover the whole dataset.

댓글을 달려면 로그인하십시오.

답변 (1개)

Frank B.
Frank B. 2018년 5월 8일
Here is a quick answer using datasample, for a single vector named data. Loop over your classes or use indexes if they have to be shared.
load ('data.mat')
% Declaring data division ratio
% 25% for training, 25% for validation, 50% for test
dataset_div=[0.25 0.25 0.5];
% Number of data in each set
nb_train=(dataset_div(1)/sum(dataset_div))*length(data);
nb_valid=(dataset_div(2)/sum(dataset_div))*length(data);
nb_test=(dataset_div(3)/sum(dataset_div))*length(data);
% Splitting data in 3 un-overlapping vector
% Training data
[data_train,idx_sample]=datasample(data,nb_train,'Replace',false);
% Removing used values
idx_left=1:length(data);
idx_left(idx_sample)=[];
val_left=data(idx_left);
% Validation data
[data_valid,idx_sample]=datasample(val_left,nb_valid,'Replace',false);
% Removing used values
idx_left=1:length(val_left);
idx_left(idx_sample)=[];
val_left=data(idx_left);
% Test data
[data_test,idx_sample]=datasample(val_left,nb_test,'Replace',false);
Cheers

카테고리

Help CenterFile Exchange에서 Deep Learning Toolbox에 대해 자세히 알아보기

제품

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by