Preallocation of composites using smpd

Question

David Short 2017년 5월 18일

1
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/340914-preallocation-of-composites-using-smpd

댓글: David Short 2017년 5월 25일

I'm seeing dramatically non-linear execution times with the test code below where I am allocating a different gpu for up to 4 spmd worker. (Yes, I do have the hardware) I'll then make some work on each worker and time it for 10 trials.

Note the clear line w/in the trial loop but outside the smpd loop.

If that clear is included the trial_times make sense. If that clear is not included the trial_times do not make sense.

As an example when n_gpu's = 2 with the clear produces values in trial_time with a narrow range of 0.0938 to 0.1111, but w/out the clear I get 0.3884 0.0915 6.4601 15.2599 15.2746 15.2792 15.2892 15.2900

....I'm left pondering that if this were not smpd code I would find a way to preallocate the data, but I'm not sure how to do that with composites in this case.

Ideas and explanations are welcome.

for N_gpus=1:4
    poolobj = gcp('nocreate'); % If no pool, do not create new one.
    if isempty(poolobj)
      poolobj = parpool( N_gpus );
      poolsize = poolobj.NumWorkers;
    else
      poolsize = poolobj.NumWorkers;
    end
      for trial=1:10   
        spmd( N_gpus )
          g = gpuDevice();
        end
          tic
          spmd( N_gpus )
              for m=1:50
                  A = rand(5000,5000,'gpuArray');
                  B = rand(5000,5000,'gpuArray');
                  C = A * B;
                  max_C = max(C);
              end
          end
          clear A B C;     %%THIS IS THE INTERESTING LINE
          trial_time(trial)=toc;
      end
      trial_time;
      tt = mean(trial_time(1:10));
      fprintf( 'N=%d time=%6.3f \n', N_gpus, tt );
      poolobj = gcp( 'nocreate' );
      delete( poolobj );
   end

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Joss Knight 2017년 5월 19일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/340914-preallocation-of-composites-using-smpd#answer_267655

It seems like this is just an issue of timing and synchronisation. You can see this by adding a call to wait(g) at the end of your spmd block, which will eliminate the dependency on use of clear.

Basically, if you don't call clear then the first call to rand for each trial doesn't have enough pooled memory, so it has to do a raw allocation. Of course, as it turns out it could have freed up the memory currently being used by A, but it doesn't know that it isn't going to error, so it has to create a copy first, in case A needs to be left unchanged (this wouldn't be true if your entire script was inside a function, since A doesn't have to be preserved if there's an error).

When you do a raw allocation, the device has to be synchronised. But when you do call clear the memory for A, B and C is returned to the pool and so the next time no raw allocation is needed. So no synchronisation happens. So the loop happily continues, queuing up 300 or so kernels and then exiting the spmd block and recording the time on the client, long before any of those kernels have actually finished.

So when you don't call clear you're usually getting the actual time of the previous trial, and when you do you're recording completely the wrong time, since the computations haven't finished yet.

Depending on the GPU memory, how much is needed, how much is available when the code is called, how much is already pooled due to earlier operations (MATLAB by default pools memory up to a quarter of device memory), and whether or not you're inside a function, your timing will give different results. Your best bet for getting realistic timings is to use gputimeit, or if you must, use tic and toc in conjunction with wait. However, the pool will always create confusion here because you don't necessarily know when raw allocations (which are costly even ignoring synchronisation) are going to happen.

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

Joss Knight 2017년 5월 24일

편집: Joss Knight 2017년 5월 24일

By the way, the values in max_C are fine. Asynchronous execution NEVER means you get wrong answers. If you ever ask to see, copy, or operate on the results of an operation, it will ensure that operation is finished before doing that (e.g. it won't display max_C without finishing computing max_C).

David Short 2017년 5월 25일

Good to know!

댓글을 달려면 로그인하십시오.

Preallocation of composites using smpd

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

Community Treasure Hunt

Preallocation of composites using smpd

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 4 이전 댓글 2개 표시이전 댓글 2개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기