Double summation with vectorized loops
이전 댓글 표시
Hi. I want to vectorize this double for loop because it is a bottleneck in my code. Since MATLAB is a based-one indexing language I have to create an additional term for M = 0.
R,r,lambda,phi,mu_p are constants
Slm(L,M),Clm(L,M) are matrices 70x70
Plm(L,M) is a matrix 70x71
Cl(L),Pl(L) are vectors 70x1
% function dU_r
s1 = 0;
for L = 2:70
s1 = s1 + ((R/r)^L)*(L+1)*Pl(L)*Cl(L);
for M = 1:L
s1 = s1 + ((R/r)^L)*(L+1)*Plm(L,M)*(Clm(L,M)*...
cos(M*lambda) + Slm(L,M)*sin(M*lambda));
end
end
dU_r = -(mu_p/(r^2))*s1;
% function dU_phi
s2=0;
for L = 2:70
s2 = s2 + ((R/r)^L))*Plm(L,1)*Cl(L);
for M = 1:L
s2 = s2 + ((R/r)^L)*(Plm(L,M+1)-M*tan(phi)*Plm(L,M))*...
(Clm(L,M)*cos(M*lambda) + Slm(L,M)*sin(M*lambda));
end;
end;
dU_phi = (mu_p/r)*s2;
% function dU_lambda
s3=0;
for L=2:70
for M=1:L
s3 = s3 + ((R/r)^L)*M*Plm(L,M)*(Slm(L,M)*cos(M*lambda)...
- Clm(L,M)*sin(M*lambda));
end;
dU_lambda = (mu_p/r)*s3;
댓글 수: 3
Jan
2011년 5월 31일
Plm(:, 71) is not used anywhere - is this correct?
Julián Francisco
2011년 5월 31일
Jan
2011년 5월 31일
@Julian: Now you have expanded your code. Following my example you can calcluate the three sums simultaneously.
The general rule is: Move all repeated claculations out of the loop. E.g. for dU_phi
you calculate TAN(phi) 2484 times, although it is constant.
채택된 답변
추가 답변 (2개)
Matt Fig
2011년 5월 31일
Here is a vectorized version, but I must say that I would have probably just went with Jan's FOR loop. It seems you need to see a vectorization, even if it is not the most efficient solution. Here you go:
V = ((R/r).^(2:70));
s = repmat(1:70,69,1);
s = sum(V(1:69).*(3:71).*sum(tril(Plm(2:70,1:70).*...
(Clm(2:70,1:70).*cos(s*lambda)+Slm(2:70,1:70).*...
sin(s*lambda)),1),2).'+V.*(3:71).*Pl(2:70).'.*Cl(2:70).');
댓글 수: 11
Julián Francisco
2011년 5월 31일
Jan
2011년 5월 31일
Timings (Matlab 2009a, single core), 100 repetitions:
Orig: 0.136 sec
Matt: 0.162 sec
Jan1: 0.635 sec (inner loop vectorized)
Jan2: 0.022 sec (simplified FOR loops, DOUBLE counters)
Jan3: 0.013 sec (simplified FOR loops, UINT8 counters)
Matt Fig
2011년 5월 31일
@Julian: I get similar timing results to Jan, which is why I said I would have just gone with the optimized FOR loops...
You are not the first to fall for the myth of the infallible vectorization paradigm ;-) In fact, I went ahead and wrote the above vectorization to use as an example to help convince future victims!
Julián Francisco
2011년 5월 31일
Jan
2011년 6월 1일
@Matt: Therefore I've voted Julian's question: It is really a nice example.
I'm deeply surprised about the 40% speed gain, if UINT8 or INT8 indices are used. And even less expected is the reduction to 25% when using UINT32, while INT32 leads to 40% also. I think integer indices might have big potential in a lot of functions. I'll start some tests, e.g. with the lame INTERP1.
Bjorn Gustavsson
2011년 6월 1일
Jan just to get this into my head clear and straight:
Loop with UINT8, INT8 and INT32 -> 40% speed gain,
Loop with UINT32 -> 75% speed gain?
This means having to go through much code to find where it's worth replacing flints to UINT32.
Jan
2011년 6월 1일
@Bjorn: Nope. I meant "reduction *by* 25%", not "to 25%". Sorry.
"for i=1:j" with DOUBLEs: 100% runtime
"for i=uint8(1):uint8(j)": 60%
"for i=int16(1):int16(j)": 60%
"for i=int32(1):int32(j)": 60%
"for i=uint32(1):uint32(j)": 75% (slower!)
I'd assume, that UINT32 indices are optimal on a 32 bit system. But there could be an implicite conversion to a DOUBLE, which is performed in the processor for INT32, but in software for UINT32.
Jan
2011년 6월 1일
@Bjorn: Nope. I meant "reduction *by* 25%", not "to 25%". Sorry.
"for i=1:j" with DOUBLEs: 100% runtime
"for i=uint8(1):uint8(j)": 60%
"for i=int16(1):int16(j)": 60%
"for i=int32(1):int32(j)": 60%
"for i=uint32(1):uint32(j)": 75% (slower!)
I'd assume, that UINT32 indices are optimal on a 32 bit system. But there could be an implicite conversion to a DOUBLE, which is performed in the processor for INT32, but in software for UINT32.
I'm not sure if these measurements have any general meaning. Further investigations are needed.
Matt Fig
2011년 6월 1일
@Jan, I thought we had discussed using int family as indices before? Perhaps in an old NG thread... can't remember. But anyway, yes it is true that using the int family as indices can speed things up oftentimes. The word of caution, which I know you know, but I mention for anyone reading this thread, is to watch out for overflow. For example:
A = zeros(100000,2) % Pre-allocate
for ii = 1:int8(size(A,1)) % Will run faster than expected!!
% Do something with A
end
Jan
2011년 6월 1일
@Matt: Yes, we have discussed this before. But when I look into to source of Matlab's toolbox functions, e.g. MAT2CELL, it does not look like we have impressed the TMW developers very much - at least until 2009.
I like the way you make Matlab run "faster than expected". The consequent avoiding of work is real efficiency. :-)
The OP has different timings, because his Pl and Plm are sparse. But for the full Pl and >50% full Plm sparsity wastes time by impeding the JIT.
Bjorn Gustavsson
2011년 6월 2일
@Jan, you got my hopes up with that 25%! (On the other hand maybe it was just as well that it was "by 25%" - saves me a whole lot of dreary rewriting. But with the speedup we're getting closer to the practices of "normal" programming - having to make sure that we use the optimal type for each variable...)
Julián Francisco
2011년 6월 1일
카테고리
도움말 센터 및 File Exchange에서 Loops and Conditional Statements에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!
