What type of cross validation to use if my data has 5 scans per sample to avoid having same sample in train and test set
조회 수: 1 (최근 30일)
My data (150 samples) has 5 NIR scans per sample. I am not able to average the 5 scans because some of them were taken out as they were not valid. I am using Support Vector Machines form the Classification learner apps from Machine Learning and Deep Learning Matlab 2020b.
I used tgspcread to read my NIR files onto Matlab, normalised by standard deviation and only used the valid samples for my classification. Am I right to say that the samples are independent of each other or will the remaining samples (out of the 5 scans) be termed as same even after normalisation?
My second quaetion is, what type of cross validation will be ideal to avoid having the same samples in the training set and test set considering the fact that the classification app has only KFold and Holdout Cross Validation
Drew 2023년 1월 3일
In near-infrared spectroscopy, it is common to average the spectra obtained from scans of the same sample. Even if some scans were "taken out as they were not valid", the remaining scans could still be averaged. To ignore NaNs when calculating the mean, use the "omitnan" flag for the mean function https://www.mathworks.com/help/matlab/ref/mean.html#bt5b82t-1-nanflag
If you decide to keep multiple scans from the same sample, and you want to ensure that scans from the same sample are either all in train, all in validation, or all in test when using Classification Learner, then one strategy is to use a more recent release of Classification Learner. Starting in R2021a Classification Learner, a separate test set can be loaded. This provides more ability to do some data partition outside of the app. For example, you could create separate train and test sets outside of the app, and then use those in the classification learner app.