Matrix algebra very slow on GPU
이전 댓글 표시
I've been testing some of the Matlab matrix routines on a TESLA K20 GPU. So far I've found that chol, lu, \, svd, and eig all run significantly slower on the GPU than on the CPU even without including the time to transfer the data to the GPU. Is this a common experience? If not, what might I be doing wrong?
댓글 수: 7
Jill Reese
2013년 11월 11일
What version of MATLAB are you using?
Matt J
2013년 11월 11일
Are you doing the operations in double precision or single precision?
Bonnie
2013년 11월 11일
Bonnie
2013년 11월 11일
Matt J
2013년 11월 11일
Are they slower than the CPU in single precision as well?
Bonnie
2013년 11월 11일
Bonnie
2013년 11월 11일
답변 (2개)
Sean de Wolski
2013년 11월 11일
편집: Sean de Wolski
2013년 11월 11일
0 개 추천
How are you doing the timing?
If upgrading is an option, in R2013b, we released gputimeit which will give better measurements of GPU timing and of course a whole year's worth of other improvements:
And, as Jill asked: what exactly are you running?
댓글 수: 18
Matt J
2013년 11월 11일
The matrices you showed are not of compatible sizes. You should be getting errors.
Bonnie
2013년 11월 11일
Matt J
2013년 11월 11일
Can we see an example without the typo?
Edric Ellis
2013년 11월 12일
Bonnie, have you seen this benchmark http://www.mathworks.co.uk/help/distcomp/examples/benchmarking-a-b-on-the-gpu.html ? This shows a series of backslash timings on a K20, and shows a decent speedup over the CPU. Note that this times the case of "matrix \ vector" rather than "vector \ matrix" that you have stated in your comment.
Sean de Wolski
2013년 11월 12일
Hi Bonnie, can you put that code in a function file and run the function a couple of times? Timing at the command window is not very accurate.
I can confirm what Bonnie is seeing for the GTX 580, R2012b
t1 =
0.2764
t2 =
0.1444
I did this in a function file running several times as Sean suggested.
However, when I run the following, equivalent operations below, the GPU does much better relative to the CPU. Also, both implementations perform better than their mldivide counterparts. So I'm tempted to think that gpuArray.mldivide simply wasn't optimized very well for matrices of these particular relative dimensions, because maybe the mtimes implementation was felt to be more likely.
tic,
y = (L.'*X)/norm(L)^2;
t3 = toc
tic
y1 = (L1.'*X1)/norm(L1)^2;
t4 = toc
t3 =
0.0027
t4 =
0.0093
Bonnie
2013년 11월 13일
Matt J
2013년 11월 13일
I think you mean the link that Sean posted, not Jill. As he mentioned, you would need to upgrade to R2013b.
Matt J
2013년 11월 13일
It would help us if you used the

button to format your code separately from your text.
Aside from that, maybe give us some detail about your CPU?
Bonnie
2013년 11월 13일
So 12 cores? I think you have to normalize your t2 by the number of cores in some way to account for the advantage that a multi-core CPU gives you. There are machines with dozens of cores that a single GPU could never beat. Assuming your average CPU is dual core, that would mean a handicap factor of 6, bringing your speed-up ratio to around 2.8.
Pretty decent, I guess, compared to dual-core benchmarks. I doubt all gpuArray operations are expected to be faster than their CPU counterparts. You just don't want them to be slower than some average CPU.
Bonnie
2013년 11월 13일
Joss Knight
2016년 4월 27일
It might be worth answering this question for posterity.
The questioner it seems was testing at least the linear solves with a very unusual system, many right-hand-sides but only one column in the system matrix. Since this is not a typical circumstance, MLDIVIDE is not optimised for it - to get an accurate answer it has to account for possible poor conditioning by using a QR factorisation, and this is less parallelisable than other approaches to solving these equations, one of which is given in the comments to Sean's answer. Another is to solve the normal equations:
% Solve A*X = B for X
R = chol(A'*A);
X = R\(R'\(A'*B));
For SVD and EIG it is possible the same situation applies, perhaps the questioner was carrying out the SVD on a tall skinny matrix. However, it is true that these functions do not parallelise well. I found that a 2000x2000 random matrix could be factored faster on my K20 than on the CPU, but the performance tails off for larger matrices, presumably due to resource contention on the device. It does make a difference whether you ask for all three factors or just the singular values (or, in the case of EIG, whether you ask for eigenvectors or just the eigenvalues).
For LU on a general matrix and CHOL on a symmetric matrix I found my K20 was much faster than the CPU, so it would be necessary to see exactly what the questioner was doing when they were timing these functions.
카테고리
도움말 센터 및 File Exchange에서 Parallel Computing Toolbox에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!