Speed up for loop in this code for calculating mutual information (maybe using GPU computing)

Question

Ansh 2018년 1월 14일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/377090-speed-up-for-loop-in-this-code-for-calculating-mutual-information-maybe-using-gpu-computing

댓글: Ansh 2018년 1월 16일

So I want to use the code given HERE where we use the Kraskov estimation procedure to estimate the mutual information between two time series (for more information see Ref: Kraskov, Alexander, Harald Stögbauer, and Peter Grassberger. "Estimating mutual information." Physical review E 69.6 (2004): 066138).

Whilst the code seems to work fine for my uses, because of the length of time series and the amount of different time series (I am calculating mutual information between pairs of many different time series) I have, the code runs too slowly. I ran the code profiler in MATLAB and seems to be the case that the following section code in the function provided in the link is causing the slowdown:

% compute distance between each sample and its k-th nearest neighbour
dz = zeros(nObs, nObs);
dx = zeros(nObs, nObs);
dy = zeros(nObs, nObs);
for i = 1:nObs
  for j = 1:nObs
      dx(i,j) = sqrt(sum((X(i, :) - X(j, :)).^2));
      dy(i,j) = sqrt(sum((Y(i, :) - Y(j, :)).^2));
      dz(i,j) = max([dx(i, j), dy(i, j)]);
  end
end

Is there any way to speed this up? I was thinking maybe a GPU based total/partial solution might be feasible/offer a sufficient speed up. Any alternative suggestions would be very helpful (maybe parallelsing and using a parfor loop instead, although the speed up would be less and would make my future projects more complicated). I am currently using MATLAB 2016b.

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

Jan 2018년 1월 14일

How log are the X(i, :)? It matters if X is a [10 x 10'000] or [10'000 x 10] matrix.

Ansh 2018년 1월 14일

편집: Ansh 2018년 1월 14일

Hi Jan, nObs=size(X,1), where X is a 4364 by 1 vector.

Not sure if this matters, but potentially Y could be 4364 by 2 to (maybe) 4364 by 25/30 (the higher dimension of the columns is given as ideally I want the algorithm to work fast enough for the high dimension of Y, but 2 columns in Y is definitely a lower bound). Does this clarify things with regards to your answer given below?

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Jan 2018년 1월 14일

2
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/377090-speed-up-for-loop-in-this-code-for-calculating-mutual-information-maybe-using-gpu-computing#answer_300083

편집: Jan 2018년 1월 15일

MATLAB Online에서 열기

For a 1000x1000 matrix, this is 6 times faster already:

% Version 1:
n  = size(X, 1);
X = X.';
Y = Y.';
dx = zeros(n, n);
dy = zeros(n, n);
for j = 1:n
   Xj = X(:, j);
   Yj = Y(:, j);
   for i = j+1:n
      dx(i,j) = sqrt(sum(bsxfun(@minus, X(:, i), Xj) .^ 2));
      dy(i,j) = sqrt(sum(bsxfun(@minus, Y(:, i), Yj) .^ 2));
      dx(j,i) = dx(i,j);
      dy(j,i) = dy(i,j);
   end
end
dz = max(dx, dy);

The original function took 29.5 sec (R2016b, Core2Duo, Win7/64), and the cleaned version 5.2 sec.

Here the data are processed columnwise, which is much faster because neighboring elements are accessed much faster in the memory. Then the comparison my max() is done outside the loop. And finally the resulting matrix is symmetric and you can omit the computation of X(:,i) and X(:,j) if you have the results for X(:,j) and X(:,i) already.

I tried to vectorized the inner loop:

% Version 2:
n  = size(X, 1);
X = X.';
Y = Y.';
dx = zeros(n, n);
dy = zeros(n, n);
for j = 1:n
   dx(j+1:n,j) = sqrt(sum(bsxfun(@minus, X(:, j+1:n), X(:, j)) .^ 2, 1));
   dy(j+1:n,j) = sqrt(sum(bsxfun(@minus, Y(:, j+1:n), Y(:, j)) .^ 2, 1));
   dx(j,j+1:n) = dx(j+1:n,j);
   dy(j,j+1:n) = dy(j+1:n,j);
end
dz = max(dx, dy);

But this takes 21 sec for 1000x1000 arrays. But for smaller 100x100 inputs it is faster: 1.2 sec instead of 2.2 sec (100 iterations).

Now you have an efficient function to start a parallelization or computation on the GPU. Maybe this is useful (I cannot test it):

% Version 3:
parfor v = 1:2
  if v == 1
    for j = 1:n
      dx(j+1:n, j) = sqrt(sum((X(:, j+1:n) - X(:, j)) .^ 2, 1));
      dx(j, j+1:n) = dx(j+1:n, j);
    end
  else
    for j = 1:n
      dy(j+1:n, j) = sqrt(sum((Y(:, j+1:n) - Y(:, j)) .^ 2, 1));
      dy(j, j+1:n) = dy(j+1:n, j);
    end
  end
end

But parfor for the inner loop will use more cores.

댓글 수: 7
이전 댓글 5개 표시이전 댓글 5개 숨기기

Jan 2018년 1월 15일

편집: Jan 2018년 1월 15일

MATLAB Online에서 열기

Whoops! I've typed a long addition after an "[EDITED]", but it has vanished. What a pity.

For the inputs X:[4364 x 1] and Y:[4364 x 30] this is the fastest method:

n  = size(X, 1);
% X = X.';
Y  = Y.';
dx = zeros(n, n);
dy = zeros(n, n);
for j = 1:n
   dx(j+1:n,j) = abs(X(j+1:n) - X(j));
   dx(j,j+1:n) = dx(j+1:n,j);
end
for j = 1:n
   dy(j+1:n,j) = sqrt(sum(bsxfun(@minus, Y(:, j+1:n), Y(:, j)) .^ 2, 1));
   dy(j,j+1:n) = dy(j+1:n,j);
end
dz = max(dx, dy);

For a vector "sqrt(sum(v .^ 2))" can be replaced by the cheaper "abs()". I'm not sure why it is faster to run it in 2 loops, maybe a better usage of the processor cache. It takes 3.3 sec instead of 66 sec of the original version.

These loops cannot be parallelized directly, as you have found out already. Even if it would work, the result could be slower than the serial processing, because writing to the rows of a matrix can cause many cache line collisions such that the memory access of the threads blocks each other.

Try this:

% Version 4:
n  = size(X, 1);
Y  = Y.';
dx = zeros(n, n);
dy = zeros(n, n);
parfor q = 1:2
  if q == 1
    for j = 1:n
      dx(j+1:n,j) = abs(X(j+1:n) - X(j));
      dx(j,j+1:n) = dx(j+1:n,j);
    end
  end
else
  for j = 1:n
     dy(j+1:n,j) = sqrt(sum(bsxfun(@minus, Y(:, j+1:n), Y(:, j)) .^ 2, 1));
     dy(j,j+1:n) = dy(j+1:n,j);
  end
end
dz = max(dx, dy);

The first part of the loop will be much faster than the 2nd part, so the speed up will by lower than a factor 2.

If you have a many core machine, try to parallelize the Version 1. But if this is still the bottleneck of your code, a C-Mex might be more promising. In your original question you have written "the amount of different time series" - maybe it is easier to use parfor to process the different set of time series simultaneously instead of parallelizing a small inner loop of a subfunction. The goal is to let all cores work.

Without PARFOR my suggestion has a speedup of factor 20 already. Then it is time to run the profiler again if this piece of code is still the bottleneck of the code. By the way: I am happy about the speed up, and you?

[EDITED] I've replaced

sqrt(sum((Y(:, j+1:n) - Y(:, j)) .^ 2, 1))

by

sqrt(sum(bsxfun(@minus, Y(:, j+1:n), Y(:, j)) .^ 2, 1))

and the code runs 10% faster under R2016b. Interesting!

Ansh 2018년 1월 15일

편집: Ansh 2018년 1월 15일

MATLAB Online에서 열기

I think using a PARFOR loop to deal with the different pairs time series simultaneously is going to be the best option as you suggested. For the current analysis, only one 400 by 400 array is needed, so I can create another version of the code for that where the parfor is used on the outside to utilise the cores to deal with the pairs of time series.

The reason why I was resistant to use it is because a future part of this project will require not just one 400 by 400 array where each element is the mutual information between a pair of time series (luckily it is symmetric and diagonal elements won't matter too much), but potentially about 100 so better to restrain from using parfor loops and optimise for single CPU. Then when I start my future project I can use the parfor loop on the outside to speed it up there.

On my machine, version 1 gives a very good speed up already and I am very happy with the results so far, but in view of the above I might need to try optimising the code for a single CPU further. I ran the profiler again and I don't think that part can be optimised anymore. There is however, another part of the original code that is now a bottleneck:

 nx1 = zeros(nObs, 1);
 ny1 = zeros(nObs, 1);
 nx2 = zeros(nObs, 1);
 ny2 = zeros(nObs, 1);
   for i = 1:nObs
    dxSample = dx(i, :);
    dxSample(i) = [];
    dySample = dy(i, :);
    dySample(i) = [];
    dzSample = dz(i, :);
    dzSample(i) = [];
    [EpsSample, NnSample] = sort(dzSample, 'ascend');
    Eps(i) = EpsSample(k);
    Nn(i) = NnSample(k);
    nx1(i) = sum(dxSample < Eps(i));
    ny1(i) = sum(dySample < Eps(i));
    nx2(i) = sum(dxSample <= Eps(i));
    ny2(i) = sum(dySample <= Eps(i));
 end

Is there a more efficient way to do this calculation?

Jan 2018년 1월 15일

@Ansh: I have edited the posted codes and replaced the auto-expansion by an explicit bsxfun. This let the code run 10% faster again.

Is the original problem solved? Then it would be kind to accept the answer. I've spent more than an hour to improve the code and it has been an interesting problem and I'd be glad to here a "thanks".

The code from your comment is a new problem. Please open a new thread and provide some input data. Explain there: Is Eps pre-allocated? What is the typical value of k? Sorting the complete array only to find the k.th smallest element is a waste of time.

Ansh 2018년 1월 16일

I will accept the answer and start a new question. Definitely incredibly thankful for all your help so far. Apologies if I didn't seem grateful before.

댓글을 달려면 로그인하십시오.

Speed up for loop in this code for calculating mutual information (maybe using GPU computing)

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

채택된 답변

댓글 수: 7
이전 댓글 5개 표시이전 댓글 5개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

Speed up for loop in this code for calculating mutual information (maybe using GPU computing)

댓글 수: 3 이전 댓글 1개 표시이전 댓글 1개 숨기기

채택된 답변

댓글 수: 7 이전 댓글 5개 표시이전 댓글 5개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

댓글 수: 7
이전 댓글 5개 표시이전 댓글 5개 숨기기