Preallocation of composites using smpd
조회 수: 4 (최근 30일)
이전 댓글 표시
I'm seeing dramatically non-linear execution times with the test code below where I am allocating a different gpu for up to 4 spmd worker. (Yes, I do have the hardware) I'll then make some work on each worker and time it for 10 trials.
Note the clear line w/in the trial loop but outside the smpd loop.
If that clear is included the trial_times make sense. If that clear is not included the trial_times do not make sense.
As an example when n_gpu's = 2 with the clear produces values in trial_time with a narrow range of 0.0938 to 0.1111, but w/out the clear I get 0.3884 0.0915 6.4601 15.2599 15.2746 15.2792 15.2892 15.2900
....I'm left pondering that if this were not smpd code I would find a way to preallocate the data, but I'm not sure how to do that with composites in this case.
Ideas and explanations are welcome.
for N_gpus=1:4
poolobj = gcp('nocreate'); % If no pool, do not create new one.
if isempty(poolobj)
poolobj = parpool( N_gpus );
poolsize = poolobj.NumWorkers;
else
poolsize = poolobj.NumWorkers;
end
for trial=1:10
spmd( N_gpus )
g = gpuDevice();
end
tic
spmd( N_gpus )
for m=1:50
A = rand(5000,5000,'gpuArray');
B = rand(5000,5000,'gpuArray');
C = A * B;
max_C = max(C);
end
end
clear A B C; %%THIS IS THE INTERESTING LINE
trial_time(trial)=toc;
end
trial_time;
tt = mean(trial_time(1:10));
fprintf( 'N=%d time=%6.3f \n', N_gpus, tt );
poolobj = gcp( 'nocreate' );
delete( poolobj );
end
댓글 수: 0
채택된 답변
Joss Knight
2017년 5월 19일
It seems like this is just an issue of timing and synchronisation. You can see this by adding a call to wait(g) at the end of your spmd block, which will eliminate the dependency on use of clear.
Basically, if you don't call clear then the first call to rand for each trial doesn't have enough pooled memory, so it has to do a raw allocation. Of course, as it turns out it could have freed up the memory currently being used by A, but it doesn't know that it isn't going to error, so it has to create a copy first, in case A needs to be left unchanged (this wouldn't be true if your entire script was inside a function, since A doesn't have to be preserved if there's an error).
When you do a raw allocation, the device has to be synchronised. But when you do call clear the memory for A, B and C is returned to the pool and so the next time no raw allocation is needed. So no synchronisation happens. So the loop happily continues, queuing up 300 or so kernels and then exiting the spmd block and recording the time on the client, long before any of those kernels have actually finished.
So when you don't call clear you're usually getting the actual time of the previous trial, and when you do you're recording completely the wrong time, since the computations haven't finished yet.
Depending on the GPU memory, how much is needed, how much is available when the code is called, how much is already pooled due to earlier operations (MATLAB by default pools memory up to a quarter of device memory), and whether or not you're inside a function, your timing will give different results. Your best bet for getting realistic timings is to use gputimeit, or if you must, use tic and toc in conjunction with wait. However, the pool will always create confusion here because you don't necessarily know when raw allocations (which are costly even ignoring synchronisation) are going to happen.
댓글 수: 4
Joss Knight
2017년 5월 24일
편집: Joss Knight
2017년 5월 24일
By the way, the values in max_C are fine. Asynchronous execution NEVER means you get wrong answers. If you ever ask to see, copy, or operate on the results of an operation, it will ensure that operation is finished before doing that (e.g. it won't display max_C without finishing computing max_C).
추가 답변 (0개)
참고 항목
카테고리
Help Center 및 File Exchange에서 Parallel Computing Fundamentals에 대해 자세히 알아보기
제품
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!