Why is my GPU code faster with the profiler on in RTX GPUs?

Question

Néstor 2022년 11월 29일

5
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1866728-why-is-my-gpu-code-faster-with-the-profiler-on-in-rtx-gpus

댓글: Joss Knight 2022년 12월 2일

I need to process large multidimensional arrays with a series of 1D convolutions, and I found it's faster to just implement the convolution by hand in a for loop instead of using conv due to the very small kernel size. However, my code runs significantly faster when the profiler is on in certain GPUs. In particular, it is consistently 1.5x to 2x faster when using an Nvidia RTX 3080 or an Nvidia RTX 2070; when I run the code in an Nvidia A4500 or Nvidia A5000, there is no significant difference. This is significant because a single dataset can take hours.

This behavior is consistent among multiple computers, all running Linux (Ubuntu 22.04), and tested with R2021a and R2022a, and with nvidia drivers versions 515 and 520. My question is, how can I make sure I get the "fast" performance without having to embed profile on and profile off in the relevant parts of my code? I have actually done this and I benefit directly from improved performance in the big picture of processing an entire dataset, but this is hacky and will interfere with the expected use of the profiler in the rest of the code.

MWE is here. I am placing the fastest run first to avoid confusion about the second instance potentially running faster due to the JIT or caching. I am also clearing the large variables between runs to avoid confusion about memory allocation. I am also using the results to calculate arrayMean to avoid confusion about the JIT optimizing (i.e., skipping operations) for unused results. Interestingly, the above three concenrs do not matter in practice and the code runs consistently faster with the profiler on.

% Define common parms
clear
convSize = 3;
largeArraySizes = [40, 40, 40, 5000] + [1, 1, 1, 0] * (2 * convSize + 1);
% Run with profiler on. First, preallocate and create variables.
largeArray = ones(largeArraySizes, 'single', 'gpuArray');
convKernel = ones(2 * convSize + 1, 1, 'single', 'gpuArray');
profile('on')
tic;
largeArrayConv = zeros(size(largeArray, 1), size(largeArray, 2),...
  size(largeArray, 3) - 2 * convSize, size(largeArray, 4), 'like', largeArray);
% Convolve manually in a for loop
for thisShift = -convSize:convSize
  % Shifted index
  idx = convSize + (1:size(largeArray, 3) - 2 * convSize) + thisShift;
  % Sum over convolved index
  largeArrayConv = largeArrayConv + ...
    convKernel(convSize + 1 + thisShift) .* largeArray(:, :, idx, :, :, :) / (2 * convSize + 1);
end
largeArrayConv = gather(largeArrayConv);
timeProfOn = toc;
profile('off')
arrayMean = mean(largeArrayConv, 'all');
clear largeArray convKernel largeArrayConv arrayMean
fprintf('Proc time profiler ON: %g seconds.\n', timeProfOn)
% Run with profiler off. First, preallocate and create variables.
largeArray = ones(largeArraySizes, 'single', 'gpuArray');
convKernel = ones(2 * convSize + 1, 1, 'single', 'gpuArray');
profile('off')
tic;
largeArrayConv = zeros(size(largeArray, 1), size(largeArray, 2),...
  size(largeArray, 3) - 2 * convSize, size(largeArray, 4), 'like', largeArray);
% Convolve manually in a for loop
for thisShift = -convSize:convSize
  % Shifted index
  idx = convSize + (1:size(largeArray, 3) - 2 * convSize) + thisShift;
  % Sum over convolved index
  largeArrayConv = largeArrayConv + ...
    convKernel(convSize + 1 + thisShift) .* largeArray(:, :, idx, :, :, :) / (2 * convSize + 1);
end
largeArrayConv = gather(largeArrayConv);
timeProfOff = toc;
arrayMean = mean(largeArrayConv, 'all');
clear largeArray convKernel largeArrayConv arrayMean
fprintf('Proc time profiler OFF: %g seconds.\n', timeProfOff)

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Joss Knight 2022년 12월 1일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1866728-why-is-my-gpu-code-faster-with-the-profiler-on-in-rtx-gpus#answer_1117248

MATLAB Online에서 열기

This is due to an optimization which is not performing ideally under memory pressure. If you reduce the size of your input you'll see that it's only where you're near the limit of your GPU memory that you see this discrepancy.

When PCT sees a series of element-wise operations like this it fuses them together so it can run a single kernel, as in

largeArrayConv = largeArrayConv + k1.*largeArray(idx1) + k2.*(largeArray(idx2)) + k3.*(largeArray(idx3)) ...
    

Unfortunately this means that memory must be allocated for the intermediates and when you're low on memory you'll end up with a lot of raw allocs and frees. When the profiler is on this optimization doesn't happen so that the measurements make sense, and so you only ever need one temporary array allocation per loop iteration.

Of the various possible workarounds the easiest is probably just to add wait(gpuDevice) before the end of your for loop.

I agree that the optimization is misbehaving in this case and we'll take a look at how it might be improved.

댓글 수: 2
없음 표시없음 숨기기

Néstor 2022년 12월 2일

Thanks, Joss, this explanation is compatible with my observations: increasing the array size to use a similar fraction of the available memory in the A4500 card was sufficient to reproduce the behavior I was seeing with the RTX cards.

I tried adding wait(gpuDevice) but, although there is a small improvement (~1.2x faster), the code with the profiler off is still significantly slower (~1.5x) than with the profiler on.

I would be happy to try more complicated workarounds to recover the full performance, what other suggestions do you have?

Joss Knight 2022년 12월 2일

MATLAB Online에서 열기

I'm surprised about that. This is how I adapted your code:

clear
gpu = gpuDevice();
convSize = 3;
largeArraySizes = [40, 40, 40, 5000] + [1, 1, 1, 0] * (2 * convSize + 1);
% Run with profiler on. First, preallocate and create variables.
largeArray = ones(largeArraySizes, 'single', 'gpuArray');
convKernel = ones(2 * convSize + 1, 1, 'single');
largeArrayConv = zeros(size(largeArray, 1), size(largeArray, 2),...
  size(largeArray, 3) - 2 * convSize, size(largeArray, 4), 'like', largeArray);
wait(gpu)
profile off
gputimeit(@()runConvolutionFull(convSize, largeArray, convKernel))
profile on
gputimeit(@()runConvolutionFull(convSize, largeArray, convKernel))
profile off
function largeArrayConv = runConvolutionFull(convSize, largeArray, convKernel)
largeArrayConv = 0;
for thisShift = -convSize:convSize
  % Shifted index
  idx = convSize + (1:size(largeArray, 3) - 2 * convSize) + thisShift;
  % Sum over convolved index
  k = convKernel(convSize + 1 + thisShift) / (2 * convSize + 1);
  largeArrayPiece = largeArray(:, :, idx, :, :, :);
  largeArrayConv = k .* largeArrayPiece + largeArrayConv;
  wait(gpuDevice)
end
end

This makes 100% sure we're only timing the things that are consistent between the two scenarios.

댓글을 달려면 로그인하십시오.

Why is my GPU code faster with the profiler on in RTX GPUs?

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 2
없음 표시없음 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

Why is my GPU code faster with the profiler on in RTX GPUs?

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 2 없음 표시없음 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 2
없음 표시없음 숨기기