Does the pca function restrict the number of components to be kept?

Question

elid latf 2018년 6월 25일

1
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/407348-does-the-pca-function-restrict-the-number-of-components-to-be-kept

댓글: elid latf 2018년 6월 26일

Hi there,

I'm using pca function to reduce the number of variables in a huge given datasets. It works well but when I want to change the number of components to be kept, i can't go beyond the number of observations.

To put it differently, let's say that my dataset has 500x24300 dim, I want to reduce it into 500x16100. However, the function works only for a number less than 499 and gives an error otherwise.

I'm using Matlab 2016a.

Does anyone have an idea ? Thanks for your help.

댓글 수: 2
없음 표시없음 숨기기

John D'Errico 2018년 6월 26일

Please get used to using comments, instead of adding answers for every response you make.

elid latf 2018년 6월 26일

Yes, you're right. I haven't realized till after posting it. thanks.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Anton Semechko 2018년 6월 25일

3
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/407348-does-the-pca-function-restrict-the-number-of-components-to-be-kept#answer_326228

The eigenvectors computed by PCA (and its generalized version called probabilistic PCA) only span the subspace of the ambient space containing the sample data; and are therefore based on linear combinations of the sample datapoints. If N and D are the number of samples and dimensionality of the data, respectively, then min(N-1,D) is the maximum number of principal components (PCs) you will be able to extract. The number of PCs will be even smaller if the data points are linearly dependent.

In principle, you can always find the complement of the PCA subspace (i.e., set of eigenvectors orthogonal to the PCs), but that it is very rarely done in practice, especially when dealing with high-dimensional spaces like yours (i.e., 24300 dimensions).

댓글 수: 5
이전 댓글 3개 표시이전 댓글 3개 숨기기

John D'Errico 2018년 6월 26일

I was going to answer this the other day, but Anton's was entirely adequate. So I'll just do a +1, and add to what he said.

Think of your data as a set of 24300 variables. Some of them move together, consistently, redundantly. Always up when one is up, down when the other is down. Maybe your process is stock market prices, as seen on 500 days, whatever. PCA tells you that some piece of information, I'll call it X1, must be really important because it was seen to happen in every variable. In the stock market analogy, X1 might be related to the price of crude oil, but we cannot know that. (Other tools, PLS is one as I recall, might help you to uncover that relationship.) PCA can only report an eigenvalue, telling you the apparent relative significance of that piece of information. Then it reduces that to one variable in the output. And since you have only 500 measurements of the process under study, it cannot know which piece of information it should repeat, another 15600 times. Which of the 15600 things that PCA did not see happen should it retain?

It is unlikely that you really have 16100 independent things happening in your data, all of which are important, none of which are noise. But suppose that you do? Then in context of PCA, you arguably need more data, more "measurements" of that process, to allow the analysis tool to recognize there are multiple things happening independently of each other.

Anton Semechko 2018년 6월 26일

That min(N-1,D) is the maximum number of PCs that can be extracted from a D-by-N data matrix is a "theoretical" limitation. It is simply not possible to extract more than min(N-1,D) PCs that contain ANY information what so ever about your data.

Note that even though the maximum number of PCs may be much smaller than dimensionality of the data, when put together, they represent the original data with 100% accuracy. However, real data often contains noise, and the information carried by the "higher-order" PCs will be increasingly dominated by noise. When performing dimensionality reduction, which is what I am assuming you want to do, you

1) Select the first K<min(N-1,D) PCs that retain as much of the underlying structure carried by the data as possible; the remaining min(N-1,D)-K PCs will be dominated by noise

2) Project observed data (after centering it on the mean) on the K PCs, to get the so-called "feature vectors" (or scores in statistics literature). These K-dimensional feature vectors are low-dimensional representations of your data.

Various methods have be developed to determine the optimal value of K (e.g., Horn's rule, cross-validation), but none of them work 100% of the time; because real data rarely meets underlying assumption of the PCA model (see [1] and [2] for details).

[1] Roweis, 1998, EM algorithms for PCA and SPCA

[2] Tipping & Bishop, 1999, Probabilistic principal component analysis

elid latf 2018년 6월 26일

Again, thank you all so much for the time you've given to my question. It was so interesting.

댓글을 달려면 로그인하십시오.

Does the pca function restrict the number of components to be kept?

댓글 수: 2
없음 표시없음 숨기기

채택된 답변

댓글 수: 5
이전 댓글 3개 표시이전 댓글 3개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

Does the pca function restrict the number of components to be kept?

댓글 수: 2 없음 표시없음 숨기기

채택된 답변

댓글 수: 5 이전 댓글 3개 표시이전 댓글 3개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 5
이전 댓글 3개 표시이전 댓글 3개 숨기기