Partition of data based on percentages (for cross-validation)

Question

0 개 추천

Hi all,

I have a matrix A made of x rows and y columns.

I would like to take 80% of my matrix A based on the number of rows, and do so 5 times, so as to equally partition my data set. So A1 would be the first 80% of the rows (and all columns), etc.

I had a look at this article https://uk.mathworks.com/help/nnet/ug/divide-data-for-optimal-neural-network-training.html but I am not sure any of these functions does what I want.

Please could anyone help me?

Thanks a lot

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

Walter Roberson 2017년 7월 1일

MATLAB Online에서 열기

2 개 추천

If you are using a neural network, then you would configure net.divideFcn to dividerand() (the default) and set net.divideParam to the percentages you want.

Otherwise, use

nrow = size(A,1);
ntrain = floor(nrow * 80/100);
train_ind = randperm(nrow, ntrain);
train_rows = A(train_ind, :);

댓글 수: 10
이전 댓글 8개 표시 이전 댓글 8개 숨기기

Walter Roberson 2017년 7월 3일

편집: Walter Roberson 2019년 4월 12일

MATLAB Online에서 열기

"how do I make sure I take 80% five times so that my data is evenly cover?"

num_trials = 5;
nrow = size(A,1);
ntrain = floor(nrow * 80/100);
for K = 1 : num_trials
  train_ind(K,:) = randperm(nrow, ntrain);
  train_rows(:,:,K) = A(train_ind(K,:), :);
end

"I will also have to compare two matrixes later (A and B) and have to take the same 80% of each five times"

for K = 1 : num_trials
  B_rows(:,:,K) = B(train_ind(K,:), :);
end

"But then how do I make sure I take 80% five times so that my data is evenly cover?"

If you mean that each particular row should appear in exactly 4 of the 5 trials, then:

nrow = size(A,1);
temp = mod( (1 : nrow) - 1, 5 ) + 1;
drop_from = temp( randperm(nrow) );
train_rows_cell = cell(1, 5);
for K = 1 : 5
  train_rows_cell{K} = A( drop_from ~= K, :);
end

We had to switch to a cell array instead of a 3D numeric array because we now have the possibility that not all of the matrices will be the same size. If the number of rows in the matrix is not a multiple of 5, then some of the matrices will end up a different size.

For example suppose we have 21 rows, and for each row we randomly picked a matrix number, 1 to 5, for it to be dropped from. Then if we pick randomly, statistically it could be that (for example) all of the drops came from matrix #5, leaving you with 21 rows for each of the first four matrices and no rows for the fifth matrix. So clearly we need to distribute the matrix to be dropped form evenly. With 21 to work with, we can take a random permutation of repmat(1:5, 1, 4) to evenly distribute the drops for 20 of the 21 rows, but the 21st row cannot be evenly distributed: if you drop it from matrix 1 then matrix 1 ends up with 16 rows but the rest end up with 17 rows, and likewise for each of the other matrices. Therefore if you want to enforce that each row should appear in exactly four of the five matrices, you are forced into using matrices of different sizes unless the number of rows happens to be a multiple of 5.

This is not a problem if you go for statistical coverage instead of precisely even coverage.

Walter Roberson 2019년 5월 3일

MATLAB Online에서 열기

For Holdout, M is not the percentage to hold out: it is the fraction

If you want the train index and test logical vectors you can get those directly from crossvalind:

[trainIdx, testIdx] = crossvalind('Holdout', FeatureLabSHUFFLE, M);

MA-Winlab 2019년 5월 3일

편집: MA-Winlab 2019년 5월 3일

Yes @ Walter Roberson, thank you for the note. M should be between within the rang [0 1].

One more question:

With HoldOut, we do not have a loop like the one with k-fold! but how to do cross validation?

I mean, with HoldOut, we do not repeat the partition and traing a nd testing for k times as with k-fold!

I tried it in a loop of k=5 and kept checking cp.CorrectRate and cp.LastCorrectRate. They were the same, i.e cp.CorrectRate is not rolling.

Please correct me if I mistaken.

I am saying this because I read this

Using this method within a loop is similar to using K-fold cross-validation one time outside the loop, except that nondisjointed subsets are assigned to each evaluation.

This mote is from here

댓글을 달려면 로그인하십시오.

Answer 2

Greg Heath 2017년 7월 2일

편집: Greg Heath 2017년 7월 25일

MATLAB Online에서 열기

0 개 추천

1. NOTE: Contrary to most statistical regression subroutines, MATLAB Neural Network subroutines operate on COLUMN VECTORS!

2. For N O-dimensional "O"utput target vectors corresponding to N I-dimensional "I"nput vectors:

        [ I N ] = size(input)
        [ O N ] = size(target)

3. Correspondingly, the data in the MATLAB NN database is stored columnwise. See the results of the keyboard commands

      help nndatabase
      doc nndatabase

4. The MATLAB NN Toolbox DEFAULT data division ratio is 0.7/0.15/0.15 with

    Ntst = floor(0.15*N)
    Nval = Ntst
    Ntrn = N - Nval -Ntst

4. Instead of TRYING to evenly divide the data for m-fold crossvalidation, it is far easier to just use Ntrials designs with RANDOM datadivision AND RANDOM initial weights.

5. I gave up the nitpicking index considerations of worrying about the number of times each data point was in each of the trn/val/tst subsets. If you have concerns, just increase Ntrials!

6. Somewhere in several of my NEWSGROUP and/or ANSWERS posts, I did use nitpicking XVAL index considerations. Good Luck if you want to find some. I would first search using XVAL.

Hope this helps.

Thak you for formally accepting my answer

Greg

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Answer 3

Lulu Dulac 2017년 7월 2일

MATLAB Online에서 열기

0 개 추천

Thank you very much for your answer. Unfortunatly I can't use dividerand() as I don't have Matlab2017a.

I have tried with the code you proposed. I understand this part:

nrow = size(A,1);
ntrain = floor(nrow * 80/100);

But then how do I make sure I take 80% five times so that my data is evenly cover? I will also have to compare two matrixes later (A and B) and have to take the same 80% of each five times...

Thank you a lot for your answer

댓글 수: 4
이전 댓글 2개 표시 이전 댓글 2개 숨기기

Greg Heath 2017년 7월 2일

편집: Greg Heath 2017년 7월 2일

MATLAB Online에서 열기

This should help:

                            HITS
 SEARCH            NEWSGROUP     ANSWERS
 CROSSVAL GREG         12           14
 CROSSVAL              49          114
 CROSSVALIND GREG       7           12
 CROSSVALIND           46           77
 CVPARTITION GREG      11           14
 CVPARTITION           40          106

Greg

Walter Roberson 2017년 7월 3일

"I don't have NN toolbox"

Then that was the operative limitation, not the fact that you are not using R2017a.

In my Answer I posted code for row-wise random division without any toolboxes. I did use a syntax of randperm that did not become available until R2011a.

댓글을 달려면 로그인하십시오.

Answer 4

ranjana roy chowdhury 2019년 7월 14일

0 개 추천

i have a dataset of 339 * 5825,i want to initialize 4 % of the dataset values with 0 excludind the entries that have -1 in it.please help me.

댓글 수: 2
없음 표시 없음 숨기기

Greg Heath 2019년 7월 14일

Start a new file and provide more details.

Greg

ranjana roy chowdhury 2019년 7월 15일

the dataset is WS Dream dataset with 339*5825.The entries have values between 0 and 0.1,few entries are -1.I want to make 96% of this dataset 0 excluding the entries having -1 in dataset.

댓글을 달려면 로그인하십시오.

Partition of data based on percentages (for cross-validation)

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

답변 (4개)

댓글 수: 10
이전 댓글 8개 표시 이전 댓글 8개 숨기기

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글 수: 4
이전 댓글 2개 표시 이전 댓글 2개 숨기기

댓글 수: 2
없음 표시 없음 숨기기

카테고리

태그

Community Treasure Hunt

Partition of data based on percentages (for cross-validation)

댓글 수: 0 이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

답변 (4개)

댓글 수: 10 이전 댓글 8개 표시 이전 댓글 8개 숨기기

댓글 수: 0 이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글 수: 4 이전 댓글 2개 표시 이전 댓글 2개 숨기기

댓글 수: 2 없음 표시 없음 숨기기

카테고리

태그

참고 항목

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글 수: 10
이전 댓글 8개 표시 이전 댓글 8개 숨기기

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글 수: 4
이전 댓글 2개 표시 이전 댓글 2개 숨기기

댓글 수: 2
없음 표시 없음 숨기기