How to apply PCA correctly?

Question

Sepp 2015년 12월 12일

2
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/259957-how-to-apply-pca-correctly

댓글: the cyclist 2024년 6월 25일

Hello

I'm currently struggling with PCA and Matlab. Let's say we have a data matrix X and a response y (classification task). X consists of 12 rows and 4 columns. The rows are the data points, the columns are the predictors (features).

Now, I can do PCA with the following command:

[coeff, score] = pca(X);

As I understood from the matlab documentation, coeff contains the loadings and score contains the principal components in the columns. That mean first column of score contains the first principal component (associated with the highest variance) and the first column of coeff contains the loadings for the first principal component.

Is this correct?

But if this is correct, why is then X * coeff not equal to score?

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

DrJ 2019년 12월 11일

편집: DrJ 2019년 12월 11일

Sepp @Sepp

your doubt can be clarified by this tutorial (eventhough in another program context) .. specially after 5' in https://www.youtube.com/watch?v=eJ08Gdl5LH0

the cliclist

fabulous and generous explanation

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

the cyclist 2015년 12월 12일

23
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/259957-how-to-apply-pca-correctly#answer_202987

편집: the cyclist 2024년 6월 25일

MATLAB Online에서 열기

==============================================================================

EDIT: I recommend looking at my answer to this other question for a more detailed discussion of topics mentioned here.

==============================================================================

Maybe this script will help.

rng 'default'
M = 7; % Number of observations
N = 5; % Number of variables observed
X = rand(M,N);
% De-mean
X = bsxfun(@minus,X,mean(X));
% Do the PCA
[coeff,score,latent] = pca(X);
% Calculate eigenvalues and eigenvectors of the covariance matrix
covarianceMatrix = cov(X);
[V,D] = eig(covarianceMatrix);
% "coeff" are the principal component vectors.
% These are the eigenvectors of the covariance matrix.
% Compare the columns of coeff and V.
% (Note that the columns are not necessarily in the same *order*,
%  and they might be slightly different from each other
%  due to floating-point error.)
coeff
V
% Multiply the original data by the principal component vectors
% to get the projections of the original data on the
% principal component vector space. This is also the output "score".
% Compare ...
dataInPrincipalComponentSpace = X*coeff
score
% The columns of X*coeff are orthogonal to each other. This is shown with ...
corrcoef(dataInPrincipalComponentSpace)
% The variances of these vectors are the eigenvalues of the covariance matrix, and are also the output "latent". Compare
% these three outputs
var(dataInPrincipalComponentSpace)'
latent
sort(diag(D),'descend')

댓글 수: 17
이전 댓글 15개 표시이전 댓글 15개 숨기기

evelyn 2024년 6월 21일

이동: the cyclist 2024년 6월 25일

MATLAB Online에서 열기

appreciate the explanation.

Has anyone try when data is complex-value number.

When I generate complex-value number data, the result is equite different when using 'pca function' directly and using eigenvectors of covariance matrix. Could anyone help me! code is shown below.

M = 13; % Number of observations
N = 4; % Number of variables observed
PCA_component_Num = 4;
data_matrix = rand(M,N) + 1i * randn(M,N);
X = data_matrix;
% De-mean
centredX = bsxfun(@minus,X,mean(X))
centredX = 
   0.4454 + 0.8543i   0.4902 - 0.1472i   0.4357 - 1.3816i   0.2678 - 2.3164i
  -0.4331 + 0.5001i  -0.2441 + 2.0524i   0.4457 + 1.2104i  -0.4510 + 0.0638i
  -0.3329 + 0.1551i  -0.1774 - 1.1576i  -0.1812 - 1.3566i   0.1672 - 0.9104i
  -0.2118 + 0.1088i   0.1244 - 1.7458i   0.1688 - 0.8478i   0.2788 + 0.8375i
  -0.3907 - 2.8987i  -0.2099 - 0.6153i  -0.2118 + 0.8387i  -0.4950 + 0.0564i
   0.4255 + 0.4438i   0.5516 + 0.2240i  -0.1030 + 0.2717i   0.2007 + 0.4471i
   0.1752 + 0.9419i  -0.2258 + 0.7153i  -0.2437 + 0.6895i   0.2486 - 0.1282i
  -0.2496 + 0.0594i   0.1494 + 0.4663i  -0.3012 - 0.2444i   0.2748 + 0.0877i
  -0.1979 + 0.7959i   0.1506 - 0.6308i   0.1326 + 0.2832i  -0.3350 - 0.3773i
   0.0798 - 0.9431i  -0.3518 - 0.1190i  -0.1554 - 0.6269i   0.1827 + 0.5595i
   0.3772 - 0.3679i  -0.0558 + 0.2619i   0.3242 - 0.4218i  -0.1654 + 0.9524i
   0.2377 + 1.0162i   0.1443 - 0.2092i  -0.0725 + 1.0026i  -0.2486 + 1.1169i
   0.0752 - 0.6657i  -0.3458 + 0.9049i  -0.2382 + 0.5830i   0.0743 - 0.3890i
%% PCA using Matlab function directly
[coeff,score,latent] = pca(centredX);
% latent is eigenvalues of the covariance matrix of X.
% "coeff" are the principal component vectors.
dataInPrincipalComponentSpace = centredX * coeff;
% Multiply the original data by the principal component vectors
% to get the projections of the original data on the
% principal component vector space. This is also the output "score".
% Compare ...
dataInPrincipalComponentSpace
dataInPrincipalComponentSpace = 
  -0.0812 + 2.1154i  -0.4314 + 1.4340i  -0.8775 - 1.0683i   0.3017 - 0.0138i
  -1.5047 - 1.6008i   0.9383 + 0.5912i  -0.3305 - 0.4482i   0.3796 - 0.2641i
   0.9170 + 1.1712i  -1.2558 + 0.5356i  -0.3824 + 0.0542i   0.0834 - 0.0594i
   0.5283 + 0.9448i  -1.1058 - 0.7001i   0.4632 + 1.2408i   0.0014 + 0.0869i
   1.6497 - 1.1864i  -0.0012 - 2.0990i  -0.6828 - 0.7879i   0.1430 + 0.5765i
  -0.5577 + 0.1855i   0.5324 - 0.0464i   0.3590 + 0.2608i  -0.4757 + 0.0544i
  -0.7034 - 0.4397i   0.4606 + 0.9445i   0.4283 - 0.1126i  -0.1720 + 0.2747i
  -0.2168 - 0.1500i  -0.1541 + 0.1499i  -0.0096 - 0.1599i  -0.4818 - 0.4071i
  -0.1823 + 0.3246i  -0.1215 + 0.5045i  -0.3390 + 0.5827i   0.2223 + 0.7306i
   0.7351 - 0.1045i  -0.2962 - 0.7219i   0.3674 - 0.1313i  -0.0440 - 0.6950i
  -0.0031 - 0.0195i   0.3168 - 0.6788i   0.4030 + 0.2911i   0.2171 - 0.8432i
  -0.6700 - 0.3994i   0.6606 + 0.0546i   0.6598 + 1.2832i  -0.1326 + 0.5711i
   0.0892 - 0.8413i   0.4573 + 0.0319i  -0.0589 - 1.0047i  -0.0424 - 0.0116i
score
score = 
  -0.0812 + 2.1154i  -0.4314 + 1.4340i  -0.8775 - 1.0683i   0.3017 - 0.0138i
  -1.5047 - 1.6008i   0.9383 + 0.5912i  -0.3305 - 0.4482i   0.3796 - 0.2641i
   0.9170 + 1.1712i  -1.2558 + 0.5356i  -0.3824 + 0.0542i   0.0834 - 0.0594i
   0.5283 + 0.9448i  -1.1058 - 0.7001i   0.4632 + 1.2408i   0.0014 + 0.0869i
   1.6497 - 1.1864i  -0.0012 - 2.0990i  -0.6828 - 0.7879i   0.1430 + 0.5765i
  -0.5577 + 0.1855i   0.5324 - 0.0464i   0.3590 + 0.2608i  -0.4757 + 0.0544i
  -0.7034 - 0.4397i   0.4606 + 0.9445i   0.4283 - 0.1126i  -0.1720 + 0.2747i
  -0.2168 - 0.1500i  -0.1541 + 0.1499i  -0.0096 - 0.1599i  -0.4818 - 0.4071i
  -0.1823 + 0.3246i  -0.1215 + 0.5045i  -0.3390 + 0.5827i   0.2223 + 0.7306i
   0.7351 - 0.1045i  -0.2962 - 0.7219i   0.3674 - 0.1313i  -0.0440 - 0.6950i
  -0.0031 - 0.0195i   0.3168 - 0.6788i   0.4030 + 0.2911i   0.2171 - 0.8432i
  -0.6700 - 0.3994i   0.6606 + 0.0546i   0.6598 + 1.2832i  -0.1326 + 0.5711i
   0.0892 - 0.8413i   0.4573 + 0.0319i  -0.0589 - 1.0047i  -0.0424 - 0.0116i
% The columns of X*coeff are orthogonal to each other. This is shown with ...
corrcoef(dataInPrincipalComponentSpace)
ans = 
   1.0000 + 0.0000i   0.0000 - 0.0000i   0.0000 + 0.0000i  -0.0000 + 0.0000i
   0.0000 + 0.0000i   1.0000 + 0.0000i   0.0000 - 0.0000i   0.0000 - 0.0000i
   0.0000 - 0.0000i   0.0000 + 0.0000i   1.0000 + 0.0000i   0.0000 + 0.0000i
  -0.0000 - 0.0000i   0.0000 + 0.0000i   0.0000 - 0.0000i   1.0000 + 0.0000i
%% eigenvector
C = cov(centredX);	
[W, Lambda] = eig(C);   
% Compare the columns of coeff and V.
% (Note that the columns are not necessarily in the same *order*,
%  and they might be *lightly different from each other
coeff
coeff = 
   0.2054 + 0.4621i   0.7054 + 0.0000i   0.4618 - 0.1538i   0.0985 - 0.0007i
  -0.3912 + 0.4334i   0.2040 - 0.3509i  -0.4306 + 0.0752i  -0.5420 + 0.0973i
  -0.5084 + 0.2284i  -0.0985 - 0.3354i  -0.0393 - 0.0839i   0.7473 - 0.0104i
  -0.2695 + 0.1417i  -0.4587 - 0.0712i   0.6008 - 0.4500i  -0.3362 + 0.1240i
W
W = 
  -0.0926 - 0.0335i   0.4618 + 0.1538i   0.6971 - 0.1082i   0.0333 - 0.5046i
   0.5422 + 0.0964i  -0.3897 - 0.1980i   0.1478 - 0.3780i   0.5480 - 0.2016i
  -0.7047 - 0.2490i   0.0188 - 0.0907i  -0.1487 - 0.3163i   0.5563 + 0.0344i
   0.3583 + 0.0000i   0.7506 + 0.0000i  -0.4642 + 0.0000i   0.3044 + 0.0000i
% The variances of these vectors are the eigenvalues of the covariance matrix, and are also the output "latent". Compare
% these three outputs
var(dataInPrincipalComponentSpace)'
ans = 4x1
    1.6616
    1.2483
    0.7982
    0.2983
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
latent
latent = 4x1
    1.6616
    1.2483
    0.7982
    0.2983
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
sort(diag(Lambda),'descend')
ans = 4x1
    1.6616
    1.2483
    0.7982
    0.2983
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
ev = (diag(Lambda))';		% 提取特征值
ev = ev(:, end:-1:1);		% eig计算出的特征值是升序的，这里手动倒序（W同理）
W = W(:, end:-1:1);
sum(W.*W, 1)    % 可以验证每个特征向量各元素的平方和均为1
ans = 
   0.4070 - 0.2162i   0.4907 - 0.1684i   0.8579 + 0.2929i   0.8552 + 0.4616i
Wr = W(:, 1:4);    % 提取前两个主成分的特征向量
%  compare Tr with dataInPrincipalComponentSpace
% both PCA component
Tr = centredX * Wr  %  新坐标空间的数据点
Tr = 
   1.0564 - 1.8346i  -0.2064 + 1.4831i  -0.0619 - 1.3811i  -0.2878 - 0.0915i
   0.5868 + 2.1172i   1.0178 + 0.4404i   0.0042 - 0.5569i  -0.4476 + 0.1164i
  -0.2665 - 1.4634i  -1.1589 + 0.7218i  -0.3386 - 0.1858i  -0.0988 + 0.0269i
  -0.0279 - 1.0821i  -1.2001 - 0.5223i  -0.3731 + 1.2708i   0.0288 - 0.0820i
  -2.0123 + 0.2823i  -0.3230 - 2.0740i  -0.0742 - 1.0399i   0.0654 - 0.5903i
   0.5800 + 0.0954i   0.5190 - 0.1275i   0.1310 + 0.4239i   0.4651 + 0.1137i
   0.4180 + 0.7165i   0.6000 + 0.8627i   0.4103 + 0.1667i   0.2564 - 0.1981i
   0.1221 + 0.2337i  -0.1293 + 0.1718i   0.0882 - 0.1337i   0.3111 + 0.5487i
   0.3125 - 0.2025i  -0.0427 + 0.5171i  -0.6206 + 0.2632i   0.0444 - 0.7624i
  -0.6993 - 0.2497i  -0.4034 - 0.6680i   0.3728 + 0.1151i  -0.1993 + 0.6672i
  -0.0064 + 0.0187i   0.2090 - 0.7193i   0.1481 + 0.4746i  -0.4956 + 0.7159i
   0.4071 + 0.6654i   0.6611 - 0.0474i  -0.2412 + 1.4226i   0.3221 - 0.4899i
  -0.4705 + 0.7031i   0.4568 - 0.0386i   0.5552 - 0.8395i   0.0357 + 0.0255i
%%  SVD
[U, S, V] = svd(centredX);
PCA_result = centredX* V(:,1:PCA_component_Num)
PCA_result = 
  -1.9001 - 0.9334i  -0.4314 + 1.4340i   0.4951 + 1.2908i   0.3018 - 0.0117i
   2.0740 - 0.7248i   0.9383 + 0.5912i   0.1720 + 0.5297i   0.3814 - 0.2615i
  -1.4427 + 0.3622i  -1.2558 + 0.5356i   0.3799 + 0.0694i   0.0838 - 0.0589i
  -1.0779 + 0.0990i  -1.1058 - 0.7001i  -0.0475 - 1.3236i   0.0008 + 0.0870i
   0.4141 + 1.9894i  -0.0012 - 2.0990i   0.3989 + 0.9632i   0.1390 + 0.5774i
   0.0570 - 0.5850i   0.5324 - 0.0464i  -0.2582 - 0.3608i  -0.4760 + 0.0511i
   0.6875 - 0.4642i   0.4606 + 0.9445i  -0.4419 - 0.0285i  -0.1739 + 0.2735i
   0.2252 - 0.1372i  -0.1541 + 0.1499i  -0.0414 + 0.1547i  -0.4790 - 0.4104i
  -0.2226 - 0.2985i  -0.1215 + 0.5045i   0.5057 - 0.4458i   0.2172 + 0.7321i
  -0.2031 + 0.7142i  -0.2962 - 0.7219i  -0.3900 + 0.0085i  -0.0392 - 0.6952i
   0.0190 + 0.0051i   0.3168 - 0.6788i  -0.2904 - 0.4036i   0.2229 - 0.8417i
   0.6372 - 0.4500i   0.6606 + 0.0546i  -0.2206 - 1.4259i  -0.1365 + 0.5701i
   0.7326 + 0.4233i   0.4573 + 0.0319i  -0.2615 + 0.9718i  -0.0423 - 0.0119i

'PCA_result', 'dataInPrincipalComponentSpace' and 'Tr' are different from each other.(they are same with real-value data)

Besides, I see someone using 'centredX* V(:,1:PCA_component_Num)' while someone using 'U(:,rk)*S(rk,:)*V'' to get principle component. Which one is correct?

the cyclist 2024년 6월 25일

Short answer

I don't know.

Longer answer

I've never done PCA on complex numbers, and I've never even heard of someone doing PCA on complex numbers. (I didn't find anything on a quick web search, and ChatGPT seems to hallucinate on some of the questions I've asked.)

At first, I thought the issue with your version might be due to the fact that you did not use all the principal components in some of the computation. I also thought there might be severe numerical instability because of cancelation in the random input matrix. But, I don't think either of those explanations are correct. (I've tried other input matrices, and get the same issues.)

I definitely don't understand why one would get different results from SVD, eig(cov matrix), and PCA.

I think this is worth crafting a smaller example with non-random input, and posting a brand-new question.

댓글을 달려면 로그인하십시오.

Answer 2

Yaser Khojah 2019년 4월 17일

2
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/259957-how-to-apply-pca-correctly#answer_371101

Dear the cyclist, thanks for showing this example. I have a question regarding to the order of the COEFF since they are different than the V. Is there anyway to see which order of these columns? In another word, what are the variables of each column?

댓글 수: 8
이전 댓글 6개 표시이전 댓글 6개 숨기기

Yuan Luo 2020년 11월 8일

why X need to be de-meaned? since pca by defualt will center the data.

the cyclist 2020년 12월 26일

MATLAB Online에서 열기

Sorry it took me a while to see this question.

If you do

[coeff,score] = pca(X);

it is true that pca() will internally de-mean the data. So, score is derived from de-meaned data.

But it does not mean that X itself [outside of pca()] has been de-meaned. So, if you are trying to re-create what happens inside pca(), you need to manually de-mean X first.

댓글을 달려면 로그인하십시오.

Answer 3

Greg Heath 2015년 12월 13일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/259957-how-to-apply-pca-correctly#answer_203067

http://www.mathworks.com/matlabcentral/answers/18439-pca-matrix-data-compression-help

Hope this helps.

Thank you for formally accepting my answer

Greg

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

How to apply PCA correctly?

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 17
이전 댓글 15개 표시이전 댓글 15개 숨기기

추가 답변 (2개)

댓글 수: 8
이전 댓글 6개 표시이전 댓글 6개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

Community Treasure Hunt

How to apply PCA correctly?

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 17 이전 댓글 15개 표시이전 댓글 15개 숨기기

추가 답변 (2개)

댓글 수: 8 이전 댓글 6개 표시이전 댓글 6개 숨기기

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

Community Treasure Hunt

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 17
이전 댓글 15개 표시이전 댓글 15개 숨기기

댓글 수: 8
이전 댓글 6개 표시이전 댓글 6개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기