Selecting important features from a very large pool

Question

Pedro Silva 2013년 5월 3일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/74441-selecting-important-features-from-a-very-large-pool

답변: fatima qureshi 2016년 1월 14일

Hello everyone.

I've been struggling with a pedestrian detection algorithm and I was wondering if MatLab had a toolbox that could help me somehow.

I extract a very large number of features from images and then I want to classify them as pedestrian or non-pedestrian. What I need is a way to determine which of the features are actually worth extracting for the sake of speed.

Imagine I have a text file with all the data. In the first column I'd enter the class (1 or 2) , and in the next columns come the features.

So the file could look like

1,1000,2000,1500,3000 ...
1,1200,1000,1600,3000 ...
2,3000,4000,120,10000 ...
2,4900,7300,3000,100 ...

and so on.

I have 30.000 features (columns) and around 10.000 samples (rows) and I would like to know if MatLab has any toolbox or function that could return the indexes of the features that are actually relevant for the classification.

Training the classifier isn't a problem, the main issue is to determine which of this features are worth extracting.

Thanks in advance.

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

Pedro Silva 2013년 5월 3일

There are in fact some algorithms specially made for handling very high dimensional data, I was hoping MatLab had some toolbox prepared to work with it

Greg Heath 2013년 5월 6일

Divide and conquer. You are never going to find the best combination anyway. So just find one that works well.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Image Analyst 2013년 5월 3일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/74441-selecting-important-features-from-a-very-large-pool#answer_84258

Well principal components analysis is the first thing that comes to my mind - well after thinking how can you have 30,000 features. Does 30.000 mean thirty thousand, or thirty? I can't see how you could possibly have 30,000 features unless you're having whole chunks of the image (groups of pixels) be features. But that doesn't make sense. Maybe for neural networks maybe, but not for feature analysis. I'm sure you can boil it down to a few dozen at most. Surely you must have some idea which of the 30 thousand features is most meaningful and which are useless.

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

Pedro Silva 2013년 5월 3일

편집: Pedro Silva 2013년 5월 3일

Well thats not like that at all.

There are algorithms that harvest large pools (yes, tens of thousands) of features and then there are ways to analyse the data in order to check which of them are relevant.

A dozen features are not nearly enough to classify a image window for pedestrian or non-pedestrian.

If you are curious I suggest you look at Viola and Jones ( http://www.merl.com/papers/docs/TR2004-043.pdf) work on a face detection algorithms, where they selected ~6000 features out of ~180000.

There is also the Integral Channel Features (<http://www.loni.ucla.edu/~ztu/publication/dollarBMVC09ChnFtrs_0.pdf>) algorithm for pedestrian detection, where 30.000 features are extracted from 128x64 sized windows and 2000 are selected. This is the one I am trying to implement, but I am stuck at the feature selection part.

Edit: These features are no more than rectangular sums of pixels over random channels of the image, over rectangles with random sizes and positions. This means you can harvest as many features as you want, but at a certain point they will be irrelevant. Thus the feature selection method.

Image Analyst 2013년 5월 3일

I've never used that many. How did Viola and Jones do it? The paper you cited says they used adaboost. Can't you do it the same way? I haven't used adaboost so I can't help you anymore.

Pedro Silva 2013년 5월 3일

편집: Pedro Silva 2013년 5월 3일

Yes, they used adaboost to select the features, but I've gone around it for too long and can't find any way to do it, or anywhere explaining it. So I'm starting to check other solutions for selecting relevant features out of a large pool, and only then use the adaboost to train and test.

댓글을 달려면 로그인하십시오.

Answer 2

Ilya 2013년 5월 6일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/74441-selecting-important-features-from-a-very-large-pool#answer_84549

You may find these two posts useful:

http://www.mathworks.com/matlabcentral/answers/48633-how-to-find-the-important-features-from-many-features-of-a-mffc-speech-sample

http://www.mathworks.com/matlabcentral/answers/33808-select-machine-learning-features-in-matlab

I don't really know what Viola and Jones did, but you could simply train AdaBoost (by calling fitensemble from Statistics Tlbx) and then execute the predictorImportance method for the trained ensemble.

댓글 수: 8
이전 댓글 6개 표시이전 댓글 6개 숨기기

Ilya 2013년 5월 7일

I am not an expert on image processing. What you say sounds reasonable enough to give it a try. When it comes to feature selection, questions like "is it possible to select features using this method" do not sound quite right. Sure, it is possible. But is it optimal? It's hard for me to say because I don't know what your constraints are (for example, who says you must use boosted trees?).

If you grow 2000 decision stumps, you will have non-zero importance for at most 2000 features out of your 15k features. If just a few features have very large importance values and the rest has small importance values, you can perhaps comfortably conclude that only these few features are needed. If you see a smooth distribution, how do you know it's ok to cut at 2000 (or any other number)? If you need to convince a statistician that you do not lose accuracy by reducing the feature pool, you need to compare the accuracy of your model for the best N selected features and the accuracy of your model for all features (by performing a formal test, for instance). But perhaps people in your field are not concerned with such issues at all.

Ilya 2013년 5월 7일

Each binary stump selects one best feature, based on the split criterion such as the Gini index. Some features can be selected many times and some features can be selected never. If you have only one informative feature and the rest 14999 features are noise, that one feature may be selected for all 2000 stumps and the rest may not be selected for any stumps.

As I wrote earlier, if you grow 2000 trees by fitensemble at the default settings, you will get at most 2000 features with non-zero importance. If you expect at least 2000 useful features, this means you expect that each stump select a feature different from all other stumps. What might this expectation be based on?

I agree - you should invest some time into understanding what AdaBoost and fitensemble do.

Pedro Silva 2013년 5월 8일

I went to watch some lectures on the matter and I understand the topic a little better.

My adaboost implementation (which is in opencv/c++) already chooses the best features out of the large pool I extract. It takes in the 15K features and creates 2000 decision stumps based on the best combination of features it finds. and, as you said, the same feature can participate in the decision any number of times.

What I really wanted to was to achieve the same classifier but with less features in order to speed up the algorithm, so I tried to find ways to eliminate redundant or irrelevant features using matlab tools, but with no success.

I figured out what my problem is. Right now I am creating only one boosted classifier to make all the decisions. So I need to feed this classifier 15K features every time I want it to make a decision. This is far from the implementation I was looking for.

What I need is a large number of boosted classifiers (a cascade) that would increase in complexity. This way I could stop the evaluation when one of those boosted classifiers gives a negative response.

But this kind of implementation is way over my programming skills.

댓글을 달려면 로그인하십시오.

Answer 3

Anand 2013년 5월 8일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/74441-selecting-important-features-from-a-very-large-pool#answer_84772

If you have the latest release of the Computer Vision System Toolbox, there's a way to train a classifier using the Viola-Jones approach, i.e. Haar-based features with an ada-boost learning framework. You might want to look at this:

http://www.mathworks.com/help/vision/ug/train-a-cascade-object-detector.html

댓글 수: 2
없음 표시없음 숨기기

Pedro Silva 2013년 5월 8일

A cascade classifier is exactly what I need. The problem is that my features are custom, and that implementation only allows for 3 types of features.

Anand 2013년 5월 9일

That's too bad...

댓글을 달려면 로그인하십시오.

Answer 4

fatima qureshi 2016년 1월 14일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/74441-selecting-important-features-from-a-very-large-pool#answer_206162

how adaboost decides which one is relevant and which is irrelevant feature????

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Selecting important features from a very large pool

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

답변 (4개)

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

댓글 수: 8
이전 댓글 6개 표시이전 댓글 6개 숨기기

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

Community Treasure Hunt

Selecting important features from a very large pool

댓글 수: 3 이전 댓글 1개 표시이전 댓글 1개 숨기기

답변 (4개)

댓글 수: 3 이전 댓글 1개 표시이전 댓글 1개 숨기기

댓글 수: 8 이전 댓글 6개 표시이전 댓글 6개 숨기기

댓글 수: 2 없음 표시없음 숨기기

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

Community Treasure Hunt

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

댓글 수: 8
이전 댓글 6개 표시이전 댓글 6개 숨기기

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기