What percentage of my targetdata should be 1 and What percentage should be 0?

Question

jack nn 2015년 6월 29일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/225709-what-percentage-of-my-targetdata-should-be-1-and-what-percentage-should-be-0

마감: MATLAB Answer Bot 2021년 8월 20일

hi everybody I am beginner in I want to use svm for classification of my data.suppose that Train data are like below:

that X1 and X2 are inputs of my network(X1 and X2 are features that we extracted ) and Y is output of my network. now I have a question if I have 15700 samples for training my network how many of them should have 1 label and how many should have 0 label(my network is 2 classes). should I have any appropriateness between labels of my classes ? What percentage of my targetdata should be 1 and What percentage should be 0? if 800 of my labels are 1 and 14900 are 0, will my network work right? thanks

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

이 질문은 마감되었습니다.

Answer 1

Martin Brown 2015년 6월 29일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/225709-what-percentage-of-my-targetdata-should-be-1-and-what-percentage-should-be-0#answer_184366

It partially depends on whether the data / distributions are separable or overlapping.

Assuming the data is separable (it probably isn't), the numbers don't matter too much as long as you have exemplars (support vectors) which lie close to the margin boundary and hence determine the decision boundary. Generally the more data you train with the better as you'll have a richer pot of potential support vectors and the relative numbers don't matter.

If the data is not separable, the numbers should typically reflect the prior class probabilities, i.e. how the examples are drawn from the real world. You give an example where about 6% are class 0 and 94% are class 1. If this reflects the fact that class 0 examples are much rarer in real life than class 1, then this is appropriate. However, if the classes are very overlapping (based on your choice of features), it may be that the classifier would just learn to say class 1 all the time as that would be right 94% of the time, but it would not be predictive in any sense. So if you have imbalanced class distributions as you seem to suggest, make sure that the features have enough discriminatory power to predict the rare class in some cases.

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

Martin Brown 2015년 7월 1일

I don't fully understand your comment/question but if you remove data according to the proportions that they occur in the data set (their prior class probabilities assuming the data has been collected in an unbiased sense), then you're simply sub sampling the data.

If you're deleting rows not in proportion to their prior probabilities, you'd be producing a biased classifier (strictly speaking an SVM doesn't produce an "easy" probabilistic classifier, but it is similar in some senses). By removing data, you'd be assigning a higher weighting to one type of errors. This may be correct in some cases (medical diagnosis, fraud detection), but you should be prepared to justify these weightings. Something like

http://www.public.iastate.edu/~maitra/stat501/lectures/Classification-I.pdf

has a decent description of this.

jack nn 2015년 7월 4일

thanks dear. I should think about your comment more. I will come back soon.

What percentage of my targetdata should be 1 and What percentage should be 0?

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

답변 (1개)

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

참고 항목

태그

Community Treasure Hunt

What percentage of my targetdata should be 1 and What percentage should be 0?

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

답변 (1개)

댓글 수: 3 이전 댓글 1개 표시이전 댓글 1개 숨기기

참고 항목

태그

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기