MATLAB Answers

what are the different between test data and training data??

syikin md radi 님이 질문을 제출함. 12 May 2015
what are different between test data and training data

  댓글 수: 0

로그인 to comment.

태그

답변 수: 2

Answer by Thomas Koelen on 12 May 2015

In a dataset a training set is implemented to build up a model, while a test (or validation) set is to validate the model built. Data points in the training set are excluded from the test (validation) set. Usually a dataset is divided into a training set, a validation set (some people use 'test set' instead) in each iteration, or divided into a training set, a validation set and a test set in each iteration.

  댓글 수: 1

로그인 to comment.


Answer by Walter Roberson
on 12 May 2015

To expand on this a small bit:
You run calculations on the training set to determine various coefficients.
You can then use the testing set to check how well the predictions do on a wider set of data, and that gives you information about false positives and false negatives.
You can use those accuracy figures to go back and re-train. You do not need to use the same division of training and test data each time: there is a common technique called "leave one out" where you deliberately drop one item at a time from the training set and re-calculate, in case that one was an outlier that was preventing getting a good overall result.
There is a nasty problem in doing classification called 'Overtraining": the calculations might fit the data you have on hand extremely well but be useless for anything else. Dividing into training and testing reduces this risk: if the algorithm has not seen a bunch of data in its calculations then it is not going to adjust itself to be exactly right for that data and bad for other things. Using all of your data to train with is therefor not a good idea.
After the program has gone back and forth on training sets and validation sets, and has decided on the best coefficients, where the data was allowed to affect the algorithm, then it is time to run it on the remaining data and produce a report. The rest of the data might not have a known classification, but it might. If the classifications are known then when the programmer looks at the report the programmer might decide it is time to change the program. Or might not. The report is the kind of thing that gets written up in a paper: we did this and that and with a limited subset of data to train and test with, we did this well on real data. Or perhaps you send it to the people designing the equipment and experiments so they can see what needs to be improved on their end. Eventually you publish the paper or write a report or the like, and other people read it and want to use your program too. But they aren't going to do that if you haven't established evidence that it is not over-training on the particular data you gave it -- and seeing how well it did on data that was not used to design the details of the algorithm is evidence.

  댓글 수: 1

I've just read your answer, can I ask for advice/help or ask a question? I've come across the sentence: "quality of prediction was estimated to be good if the difference between the training and test dataset was <5 and acceptable if it was <10%". Now my question is, how did the person choose this difference to be good or acceptable, respectively? Is that the difference on always takes or is there a rule? A reference to relate to? Advice would be much appreciated. Isabel

로그인 to comment.



Translated by