How to maximize MATLAB's GPU utility?

I've surveyed my GPU's performance against itself and the CPU for varying matrix sizes, and found the opposite of what most GPU literature suggests: the GPU's computing advantage diminishes with array size. Code, results, & specs shown below. Noteworthy observations: . (1) GPU utility remains sub-10%, according to Task Manager (2) ~(50%, 20%) = (RAM, CPU) usage for large (K > 9000) array (3) Considerable speed ratio drop's observed for around K > 8000 (4) Splitting the K > 8000 (= 9000) Xga matrix into four increases vectorized speed two-fold (5) My GPU ranks far higher among GPUs than my CPU (#24 vs. #174); it thus seems an on-par CPU would outperform the GPU for larger arrays (6) Last pic's GPU vs. CPU benchmark supports (5); GPU isn't as vastly superior as expected
What's the culprit - is my code, or MATLAB, or hardware configuration under-utilizing the GPU? How to find out and resolve it? m-files: testrun.zip (testrun compares performance for a single K; testrun0 for multiple)
%% CODE: centroid indexing in K-means algorithm
% size(X) = [16000, 3]
% size(c) = [K, 3]
% Xsg = single(X); csg = single(c);
% Xga = gpuArray(Xsg); cga = gpuArray(csg);
% Speed ratio = t2/t1, if t2 > t1 - else, t1/t2
%% TIMING
f1 = fasterFunction(...); % e.g. vectorized(Xga, cga, K, m)
f2 = slowerFunction(...); % e.g. forVectorized(X, c, m)
t1 = gputimeit(f1) % OR timeit(f1) for non-GPU arrays
t2 = timeit(f2) % OR gputimeit(f2) for GPU arrays
%% FUNCTIONS
function out = vectorized(X, c, K, m)
[~, out] = min(reshape(permute(sum((X-permute(c,[3 2 1])).^2,2), ...
[1 2 3]),m,K),[],2);
end
function out = forVectorized(X, c, m)
out = zeros(m,1);
for j=1:m
[~,out(j)] = min(sum(((X(j,:))'-c').^2));
end
end
function out = forFor(X,c,K,m)
out = zeros(m,1); idxtemp = zeros(K,1);
for i=1:m
for j=1:K
idxtemp(j) = sum((X(i,:)-c(j,:)).^2,2);
end
[~, out(i)] = min(idxtemp);
end
end
%% PLOTS
% GPU vectorized = vectorized(Xga, cga, K, m) for varying K, timed w/ gputimeit
% CPU vectorized = vectorized(Xsg, csg, K, m) for varying K, timed w/ timeit
% for-loop = forFor(Xsg, csg, K, m) for varying K, timed w/ timeit

댓글 수: 5

Jan
Jan 2019년 3월 20일
편집: Jan 2019년 3월 20일
It is hard to follow your descriptions. "GPU utility remains sub-10%", "My GPU ranks far higher among GPUs than my CPU (#24 vs. #174)", "Last pic's GPU vs. CPU benchmark supports (5)" - this might be clear for you, but it requires a lot of educated guessing for the readers. "f1 = fasterFunction(...)"? Please post running code. It is not clear, which code creates which diagram. Most of all I do not understand the actual question: "maximize MATLAB's GPU utility?"
What do you do? Which problem do you want to solve? What is your question?
idxtemp(j) = sum((X(i,:)-c(j,:)).^2,2);
The row-wise processing wastes ime compared to a columnwise processing in the CPU. Transpose the inputs to avoid this.
John Muradeli
John Muradeli 2019년 3월 20일
@Jan -- Added examples to fasterFunction(...), changed function names a bit; as to the rest - the question is clear enough; those able to respond should understand.
Jan
Jan 2019년 3월 20일
@John: Thanks for clarifying the question a little bit.
"those able to respond should understand" - yes, of course, we agree here: they should. I've mentioned, that at least for me "maximize MATLAB's GPU utility" is too vague to be answered. Why not decreasing the number of readers who do not understand the question?
You've spent some time to produce the nice diagrams. If you post the complete code instead of letting the readers guess it based on some rough comments, the members of the forum can run it on their machines and maybe confirm your observations.
"the GPU's computing advantage diminishes with array size" - doesn't the last diagram "Single precision matrix-matrix multiply" tell the opposite?
John Muradeli
John Muradeli 2019년 3월 20일
@Jan -- Unsure how columns/rows affect CPU computing, but - transposed per your suggestion, and interchanged (i,:) with (i,:) (same w/ j) - results: https://puu.sh/D2Lex/ea9c4d6189.png -- not a significant difference for range of K's tested
John Muradeli
John Muradeli 2019년 3월 20일
편집: John Muradeli 2019년 3월 20일
@Jan: Very well, I'll clarify below; as for the complete code - there's a tradeoff between conciseness and thoroughness - too much of the latter tends to throw off readers the fastest. This said, would an m-file suffice? The code isn't brief.
"Maximize GPU Utility" - see (1), (2); that is, it seems that majority of GPU resources aren't being utiilzied - and that there may be a way to utilzie them. For example, dividing workload evenly across the entire GPU - rather than have a few take all and most lay idle. I tried one method (see (4)); but strangely, for K <= 8000, computing time increases. Hence, I may be doing it wrong.
@"Doesn't the last diagram tell the opposite?" it's not so much GPU vs CPU as GPU vs GPU: performance slightly decreases after peak (circled) - but not as much as in plots above. I couldn't test for 1e9 per 'Out of Memory'

댓글을 달려면 로그인하십시오.

답변 (0개)

카테고리

도움말 센터File Exchange에서 GPU Computing in MATLAB에 대해 자세히 알아보기

제품

릴리스

R2018b

질문:

2019년 3월 20일

편집:

2019년 3월 20일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by