PCA or LSA on a very large dataset

Question

Eugene Kogan 2012년 8월 30일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/47020-pca-or-lsa-on-a-very-large-dataset

답변: arushi 2024년 7월 30일

I am trying to run LSA or PCA on a very large dataset, 50k docs by 300k terms, to reduce the dimensionality of the words.

My system runs out of memory and grinds to a halt. I am using this code on the TFIDF matrix to compute the LSA, and it gets stuck on SVD:

   % SVD decomposition of tf-idf matrix
   [ U S V ] = svd(tfidfmatrix);
   % Generate new rank reduced matrix of rank k
   Sk = S(1:K,1:K);
   Uk = U(:,1:K);
   output = inv(Sk)*Uk';

and using PRINCOMP for PCA. In both cases, it seems I have too many terms for my system to handle.

Is there a more efficient way to do dimensionality reduction for two words? The end goal is to visualize the documents in 2d or 3d where they are grouped by similarity to each other.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

arushi 2024년 7월 30일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/47020-pca-or-lsa-on-a-very-large-dataset#answer_1492311

Hi Eugene,

When dealing with very large datasets, such as your 50k documents by 300k terms matrix, traditional methods like full Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) can be computationally expensive and memory-intensive. Instead, you can use more efficient methods designed for large-scale data.Efficient Methods for Dimensionality Reduction

1. Truncated SVD (also known as Latent Semantic Analysis - LSA):

Instead of computing the full SVD, you can compute only the top K singular values and vectors using methods like svds in MATLAB.

2. Incremental PCA:

Incremental PCA is designed to handle large datasets by processing data in chunks.

Hope this helps.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

PCA or LSA on a very large dataset

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

답변 (1개)

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

PCA or LSA on a very large dataset

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

답변 (1개)

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기