Find the K most orthogonal vectors in a set of vectors

조회 수: 1 (최근 30일)
Peter Cook
Peter Cook 2016년 5월 19일
댓글: Peter Cook 2016년 6월 8일
Hello All,
The context of this particular search is a step in tuning a spectral clustering routine a la Ng et al 2002. The purpose of this search is to give a initialization point for k-means clustering in higher dimensional space. In particular, well separated data ought to sit in K tight clusters on the surface of a hypersphere, so the purpose of this search is to find the locations of these cluster centroids with which to initialize k-means clustering of the same data.
I have a data matrix "Y" with O(10^4) rows and O(10^2-10^3) columns (the columns of this matrix are the [transformed and normalized a couple times] K largest eigenvectors of the affinity matrix).
Ng et al suggest "Briefly, we let the first cluster centroid be a randomly chosen row of Y, and then repeatedly choose as the next centroid the row of Y that is closest to being 90 degrees from all the centroids already picked." I translated this mathspeak to mean I need to take a bunch of dot products and look for values close to zero. This was quoted as computationally cheap (perhaps it is for clustering fewer points into say O(10^1) clusters or perhaps they meant it in the sense that it requires fewer iterations of k-means clustering once initialized), but my CPU is dragging ass at it.
So far I've tried 2 approaches: Approach #1 - Compute everything then search
% "cheap" initialization of k-means
dotProductY = zeros(length(Y)); %preallocate to make the parser turn green
% compute dot product of every row with every other row first
for k = 1:length(Y)
dotProductY(:,k) = sum(bsxfun(@times,Y(k,:),Y),2);
end
dotProductY(logical(eye(length(Y)))) = nan; %exclude dot product of row with self
centroidIdx = randi(length(Y)); %initialize on a random row of Y
dotProductY(centroidIdx,:) = nan; %dont pick the same row twice
dotProductY = abs(dotProductY); %use the absolute value because looking for closer to zero
for k = 2:K
[~,im] = min(sum(dotProductY(:,centroidIdx),2)); %find next best centroid
centroidIdx(k) = im; %reassign
dotProductY(centroidIdx,:) = nan; %dont pick the same row twice
end
Approach #2 - Simultaneous computation and search
%try a cheaper one?
centroidIdx = randi(length(Y)); %initialize on a random row of Y
for k = 1:K-1
dotProductY(:,k) = sum(bsxfun(@times,Y(centroidIdx(k),:),Y),2); %compute inner product
dotProductY(centroidIdx,:) = nan; %dont pick the same row twice
[~,im] = min(sum(abs(dotProductY),2)); %find next best centroid
centroidIdx(k+1) = im; %reassign
end
Neither of these approaches seems cheap to me. Anyone else take a stab at this before? Any suggestions?

채택된 답변

Matt J
Matt J 2016년 5월 19일
편집: Matt J 2016년 5월 19일
Your computation of dotProductY could be more efficient. The most vectorized may of computing it, I believe is
dotProductY=abs(Y*Y.');
I expect that would have been the main bottleneck.
  댓글 수: 2
Matt J
Matt J 2016년 5월 19일
편집: Matt J 2016년 5월 22일
This part
for k = 2:K
[~,im] = min(sum(dotProductY(:,centroidIdx),2)); %find next best centroid
centroidIdx(k) = im; %reassign
dotProductY(centroidIdx,:) = nan; %dont pick the same row twice
end
also looks like it could be incrementalized as follows
im=randi(size(Y,1));
centroidIdx(1)=im;
temp=dotProductY(:,im);
for k = 2:K
[~,im] = min(temp); %find next best centroid
centroidIdx(k) = im; %reassign
temp=temp+dotProductY(:,im); %update temp
temp(im) = inf; %dont pick the same row twice
end
Peter Cook
Peter Cook 2016년 6월 8일
Thanks for the help, I can't believe I had that boneheaded dotProductY computation in there. The algorithm runtime is still quite slow, but that, I am accepting, is to be expected for most clustering algorithms with this amount of data.

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 Statistics and Machine Learning Toolbox에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by