Why is this loop faster than a vectorised version? Could the vectorised version be made faster than the loop?

Question

Michael 2023년 11월 22일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/2050857-why-is-this-loop-faster-than-a-vectorised-version-could-the-vectorised-version-be-made-faster-than

댓글: Alexander 2023년 11월 22일

I'm trying to improve performance in a code that uses a loop. I've written a vectorised version matching the functionality, while avoiding costly transposes. However, I've found that the loop version invariably runs ~25% more quickly. Is there any way to further improve the performance of the vectorised version so that it surpasses the loop?

Of course, this is a tiny sub-function of a much larger, more complex program, but it is called tens of thousands of times in a single run, and is a bottleneck in the run time.

I do have the parallel computing toolbox, so could look into using parfor loops, but these don't always save time, and I was surprised that the vectorised version doesn't perform better!

% Input vectors
x1 = rand(1, 960);
x2 = rand(1, 960);
%% Looped version
tic;
Y1 = 1.75:0.25:39;
Y2 = (10 .^ (Y1 / 21.366) - 1 ) / 0.004368;
EoutLoop = zeros(length(x2), length(Y1)); 
for i=1:length(Y1)
    p1 = 4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1).*ones(1, length(x2));
    p2 = 4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1) - 0.35.*(4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1)./4.*1000./24.673.*(0.004368.*1000 + 1)).*(x2 - 51);
    p3  = p1.*(x1 >= Y2(i)) + p2 .* (x1 < Y2(i)); 
    g1 = (x1 - Y2(i))./Y2(i); 
    g1 = abs(g1);
    EiLoop = (1 + p3.*min(g1,4)).*exp(-1.*p3.*min(g1,4)).* 10.^(x2./10);
    EoutLoop((1:length(x2)), i) = EiLoop(1:length(x2));
end
if (size(EoutLoop,1) > 1) 
    EoutLoop = sum(EoutLoop);
end
EoutLoop = 10 .* log(EoutLoop) ./ log(10);
% end timer
toc;
%% Vectorised version
% transpose input vector for vectorised version
x2 = x2.';
x1 = x1.';
% start timer
tic;
Y1 = 1.75:0.25:39;
Y2 = (10 .^ (Y1 / 21.366) - 1 ) / 0.004368;
EoutVec = zeros(length(x2), length(Y1)); 
p1 = 4.*Y2./24.673.*(0.004368.*Y2 + 1).*ones(length(x2), length(Y2));
p2 = 4.*Y2./24.673.*(0.004368.*Y2 + 1) - 0.35.*(4.*Y2./24.673.*(0.004368.*Y2 + 1)./4.*1000./24.673.*(0.004368.*1000 + 1)).*repmat((x2 - 51), 1, length(Y2));
p3  = p1.*(x1 >= repmat(Y2, 1, size(x1, 2))) + p2 .* (x1 < repmat(Y2, 1, size(x1, 2))); 
g1 = ((x1 - repmat(Y2, 1, size(x1, 2)))./repmat(Y2, 1, size(x1, 2))); 
g1 = abs(g1);
EVec = (1 + p3.*min(g1,4)).*exp(-1.*p3.*min(g1,4)).*repmat(10.^(x2./10), 1, length(Y2));
EoutVec((1:length(x2)), :) = EVec(1:length(x2), :);
if (size(EoutVec,1) > 1) 
    EoutVec = sum(EoutVec);
end
EoutVec = 10.*log(EoutVec)./log(10);
% end timer
toc;

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

Michael 2023년 11월 22일

Ok, that's helpful, thanks.

Alexander 2023년 11월 22일

I agree. On my old Win7 machine (R2021b) the result is

Loop: Elapsed time is 0.316945 seconds.

Vectorised: Elapsed time is 0.062135 seconds.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Dyuman Joshi 2023년 11월 22일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/2050857-why-is-this-loop-faster-than-a-vectorised-version-could-the-vectorised-version-be-made-faster-than#answer_1358077

편집: Dyuman Joshi 2023년 11월 22일

MATLAB Online에서 열기

Ideally, timeit should be used over tic-toc to get a more accurate idea of run times of the codes. tic-toc is generally used for portions of code.

"Use the timeit function for a rigorous measurement of function execution time. Use tic and toc to estimate time for smaller portions of code that are not complete functions." Reference - Measure the Performance of Your Code

While using tic-toc to measure the time of the code, you can either

> Run the same code multiple times via a for loop and average the data - "Sometimes programs run too fast for tic and toc to provide useful data. If your code is faster than 1/10 second, consider measuring it running in a loop, and then average to find the time for a single run." (Reference - https://in.mathworks.com/help/matlab/ref/tic.html#bswc7ww-3)

or

> Take a large(r) dataset.

I have chosen the latter option below -

% Input vectors
%% Large(r) dataset
x1 = rand(1, 100000);
x2 = rand(1, 100000);
%% Looped version
Y1 = 1.75:0.25:39;
Y2 = (10 .^ (Y1 / 21.366) - 1 ) / 0.004368;
EoutLoop = zeros(length(x2), length(Y1)); 
tic;
for i=1:length(Y1)
    p1 = 4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1).*ones(1, length(x2));
    p2 = 4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1) - 0.35.*(4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1)./4.*1000./24.673.*(0.004368.*1000 + 1)).*(x2 - 51);
    p3  = p1.*(x1 >= Y2(i)) + p2 .* (x1 < Y2(i)); 
    g1 = (x1 - Y2(i))./Y2(i); 
    g1 = abs(g1);
    EiLoop = (1 + p3.*min(g1,4)).*exp(-1.*p3.*min(g1,4)).* 10.^(x2./10);
    EoutLoop((1:length(x2)), i) = EiLoop(1:length(x2));
end
if (size(EoutLoop,1) > 1) 
    EoutLoop = sum(EoutLoop);
end
EoutLoop = 10 .* log(EoutLoop) ./ log(10);
% end timer
toc;
Elapsed time is 1.298153 seconds.

%% Vectorised version
% transpose input vector for vectorised version
x2 = x2.';
x1 = x1.';
% start timer
Y1 = 1.75:0.25:39;
Y2 = (10 .^ (Y1 / 21.366) - 1 ) / 0.004368;
EoutVec = zeros(length(x2), length(Y1)); 
tic;
p1 = 4.*Y2./24.673.*(0.004368.*Y2 + 1).*ones(length(x2), length(Y2));
p2 = 4.*Y2./24.673.*(0.004368.*Y2 + 1) - 0.35.*(4.*Y2./24.673.*(0.004368.*Y2 + 1)./4.*1000./24.673.*(0.004368.*1000 + 1)).*repmat((x2 - 51), 1, length(Y2));
p3  = p1.*(x1 >= repmat(Y2, 1, size(x1, 2))) + p2 .* (x1 < repmat(Y2, 1, size(x1, 2))); 
g1 = ((x1 - repmat(Y2, 1, size(x1, 2)))./repmat(Y2, 1, size(x1, 2))); 
g1 = abs(g1);
EVec = (1 + p3.*min(g1,4)).*exp(-1.*p3.*min(g1,4)).*repmat(10.^(x2./10), 1, length(Y2));
EoutVec((1:length(x2)), :) = EVec(1:length(x2), :);
if (size(EoutVec,1) > 1) 
    EoutVec = sum(EoutVec);
end
EoutVec = 10.*log(EoutVec)./log(10);
% end timer
toc;
Elapsed time is 0.621410 seconds.

You can see that the time taken by the vectorized approach is less than half of the time taken by the for loop approach.

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

Dyuman Joshi 2023년 11월 22일

편집: Dyuman Joshi 2023년 11월 22일

MATLAB Online에서 열기

Using timeit -

FYI - timeit() returns the median value of time measurements, where it calls the specified functions many times.

% Input vectors
x1 = rand(1, 10000);
x2 = rand(1, 10000);
F1 = @(a, b) forLoop(a, b);
F2 = @(a, b) vectorized(a, b);
f1 = @() F1(x1, x2);
f2 = @() F2(x1, x2);
%Check whether the outputs are equal or not
isequal(f1(), f2())
ans = logical
   1
fprintf('Time taken by the for loop method is %f seconds', timeit(f1))
Time taken by the for loop method is 0.064619 seconds
fprintf('Time taken by the vectorized method is %f seconds', timeit(f2))
Time taken by the vectorized method is 0.021920 seconds

As you can see from the results here, the vectorized method is more than 3x faster the for loop method.

%% Function definitions
function EoutLoop = forLoop(x1, x2)
    Y1 = 1.75:0.25:39;
    Y2 = (10 .^ (Y1 / 21.366) - 1 ) / 0.004368;
    EoutLoop = zeros(length(x2), length(Y1)); 
    for i=1:length(Y1)
        p1 = 4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1).*ones(1, length(x2));
        p2 = 4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1) - 0.35.*(4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1)./4.*1000./24.673.*(0.004368.*1000 + 1)).*(x2 - 51);
        p3  = p1.*(x1 >= Y2(i)) + p2 .* (x1 < Y2(i)); 
        g1 = (x1 - Y2(i))./Y2(i); 
        g1 = abs(g1);
        EiLoop = (1 + p3.*min(g1,4)).*exp(-1.*p3.*min(g1,4)).* 10.^(x2./10);
        EoutLoop((1:length(x2)), i) = EiLoop(1:length(x2));
    end
    if (size(EoutLoop,1) > 1) 
        EoutLoop = sum(EoutLoop);
    end
    EoutLoop = 10 .* log(EoutLoop) ./ log(10);
end
%Note that you don't need to preallocate for a vectorized approach
function EoutVec = vectorized(x1, x2)
    % transpose input vector for vectorised version
    x2 = x2.';
    x1 = x1.';
    
    Y1 = 1.75:0.25:39;
    Y2 = (10 .^ (Y1 / 21.366) - 1 ) / 0.004368;
    p1 = 4.*Y2./24.673.*(0.004368.*Y2 + 1).*ones(length(x2), length(Y2));
    p2 = 4.*Y2./24.673.*(0.004368.*Y2 + 1) - 0.35.*(4.*Y2./24.673.*(0.004368.*Y2 + 1)./4.*1000./24.673.*(0.004368.*1000 + 1)).*repmat((x2 - 51), 1, length(Y2));
    p3  = p1.*(x1 >= repmat(Y2, 1, size(x1, 2))) + p2 .* (x1 < repmat(Y2, 1, size(x1, 2))); 
    g1 = ((x1 - repmat(Y2, 1, size(x1, 2)))./repmat(Y2, 1, size(x1, 2))); 
    g1 = abs(g1);
    EVec = (1 + p3.*min(g1,4)).*exp(-1.*p3.*min(g1,4)).*repmat(10.^(x2./10), 1, length(Y2));
    EoutVec((1:length(x2)), :) = EVec(1:length(x2), :);
    if (size(EoutVec,1) > 1) 
        EoutVec = sum(EoutVec);
    end
    EoutVec = 10.*log(EoutVec)./log(10);
end

Michael 2023년 11월 22일

MATLAB Online에서 열기

Thanks, I re-wrote the test code as you suggested, and for 10000 runs the vector version took about 30% of the time as the loop (the input vectors are fixed length for the application).

% Input vectors
x1 = rand(1, 960);
x2 = rand(1, 960);
x1T = x1.';
x2T = x2.';
totalRuns = 10000;
%% Looped version
loopTime = 0;
for runs = 1:totalRuns
    % start timer
    tic
    Y1 = 1.75:0.25:39;
    Y2 = (10 .^ (Y1 / 21.366) - 1 ) / 0.004368;
    EoutLoop = zeros(length(x2), length(Y1)); 
    
    for i=1:length(Y1)
    
        p1 = 4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1).*ones(1, length(x2));
        p2 = 4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1) - 0.35.*(4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1)./4.*1000./24.673.*(0.004368.*1000 + 1)).*(x2 - 51);
        p3  = p1.*(x1 >= Y2(i)) + p2 .* (x1 < Y2(i)); 
        g1 = (x1 - Y2(i))./Y2(i); 
    
        g1 = abs(g1);
        EiLoop = (1 + p3.*min(g1,4)).*exp(-1.*p3.*min(g1,4)).* 10.^(x2./10);
    
        EoutLoop((1:length(x2)), i) = EiLoop(1:length(x2));
    end
    
    if (size(EoutLoop,1) > 1) 
        EoutLoop = sum(EoutLoop);
    end
    
    EoutLoop = 10 .* log(EoutLoop) ./ log(10);
    
    % append time
    loopTime = loopTime + toc;
end
loopTime = loopTime/runs;
disp(num2str(loopTime))
%% Vectorised version
vecTime = 0;
for runs = 1:totalRuns
    % start timer
    tic
    Y1 = 1.75:0.25:39;
    Y2 = (10 .^ (Y1 / 21.366) - 1 ) / 0.004368;
    EoutVec = zeros(length(x2T), length(Y1)); 
    
    p1 = 4.*Y2./24.673.*(0.004368.*Y2 + 1).*ones(length(x2T), length(Y2));
    p2 = 4.*Y2./24.673.*(0.004368.*Y2 + 1) - 0.35.*(4.*Y2./24.673.*(0.004368.*Y2 + 1)./4.*1000./24.673.*(0.004368.*1000 + 1)).*repmat((x2T - 51), 1, length(Y2));
    p3  = p1.*(x1T >= repmat(Y2, 1, size(x1T, 2))) + p2 .* (x1T < repmat(Y2, 1, size(x1T, 2))); 
    g1 = ((x1T - repmat(Y2, 1, size(x1T, 2)))./repmat(Y2, 1, size(x1T, 2))); 
    
    g1 = abs(g1);
    EVec = (1 + p3.*min(g1,4)).*exp(-1.*p3.*min(g1,4)).*repmat(10.^(x2T./10), 1, length(Y2));
    
    EoutVec((1:length(x2T)), :) = EVec(1:length(x2T), :);
    
    
    if (size(EoutVec,1) > 1) 
        EoutVec = sum(EoutVec);
    end
    
    EoutVec = 10.*log(EoutVec)./log(10);
    vecTime = vecTime + toc;
end
vecTime = vecTime/runs;
disp(num2str(vecTime))

Thanks for the info. I did use Profiler on the full programme, which is how I identified the bottleneck for further testing and optimisation!

Dyuman Joshi 2023년 11월 22일

You are welcome!

It's good to know that you are utilizing the Profiler, it is an extremely helpful tool!

댓글을 달려면 로그인하십시오.

Why is this loop faster than a vectorised version? Could the vectorised version be made faster than the loop?

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

채택된 답변

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

Why is this loop faster than a vectorised version? Could the vectorised version be made faster than the loop?

댓글 수: 3 이전 댓글 1개 표시이전 댓글 1개 숨기기

채택된 답변

댓글 수: 4 이전 댓글 2개 표시이전 댓글 2개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기