Is it suitable to include dependent varible when conducting PCA?

조회 수: 1 (최근 30일)
Xiaoxiao Du
Xiaoxiao Du 2022년 6월 5일
댓글: Xiaoxiao Du 2022년 6월 8일
The question emerge from a certain situation, where:
If there are N stationary time-series variables(X1~XN), each variable have 1,000,000 obervations, all of them are somehow correlated (say, 0.5-0.95). If we run the PCA on the first 800,000 observations as in-sample data, tranforming the original coordinate system into a new set of PCs coordinates, run some regression against PCA variables, and use it to examine if there are anomalies in any of the variables in the out-of-sample observation (800,001-1,000,000). Which methodologies I should adopt? (Assume PC1&PC2 explained 95% of all variance)
  1. run PCA on X1-XN on 800,000 observations, for a certain Xy, we get a predict_Xy = regress(Xy, [PC1 PC2]), where there is Xy inside PC1 and PC2. If it is correct, how can I interpret the result?
  2. run PCA on X1-XN without Xy on 800,000 observations, for a certain Xy, we get a predict_Xy = regress(Xy, [PC1 PC2]), where there is no Xy terms inside PC1 and PC2. This would be easy to interpret but if the anomaly may occur on any variable, we have to run a regression on each variable to detect the anomalies?
I've done this analysis on a 10,000,000 observation * 4 variable data. The est. error variance derived from Method2 is 3 times of Method1 (it is understandable since there is variable Xy itself inside the prediction function), the Method1 seems a lot appealing but indeed counter-tuition.
Or perhaps I should not use PCA at the beginning? If so, what methodology should I pick?

채택된 답변

the cyclist
the cyclist 2022년 6월 5일
I've never used PCA on time series data, and I had never heard of anyone doing so, but googling PCA time series does turn up some links.
Regardless, I think you definitely do not want to do the version where you include Xy in the PCA. I believe that when people commingle the explanatory and response variables, they use canonical correlation analysis. But that is also typically not for time series data.
I think it is possible that a vector autoregression model might be closer to the right model for your system. But, it is not clear to me how to identify the "anomalies" you are talking about.
Looks like something called "dynamic factor analysis" might also be useful, but I have no experience with it.
I hope that that is at least a little helpful.
  댓글 수: 7
the cyclist
the cyclist 2022년 6월 7일
If I had to put a name on the thing you are trying to do, it would be "multivariate outlier detection". Googling that phrase does turn up some hits that seem quite related to what you are trying to do. This paper discusses a few techniques.
It seems that if you are primarily interested in "these data points are far away from the bulk of the data points, in N-dimensional space", then there is some good information for you there.
Xiaoxiao Du
Xiaoxiao Du 2022년 6월 8일
Thanks! You made a quite clear conclusion, especially the 'N-dimensional space' term.
I'll be reading papers in the realm you mentioned. May I leave the question open so I can update some progress or perhaps there will be new thoughts from others?
Thanks again!

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 Dimensionality Reduction and Feature Extraction에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by