race condition at asynchronous task assignment during runtime

Question

0 개 추천

Dear Matlab community,

I am trying to figure out a problem that I anticipated easy but is proving me wrong ... Basically, I am trying to solve a race condition that arises using the parallel computing toolbox.

Imagine I have N tasks and G GPUs. Each task will take a different amount of time to be completed, and this time can be determined only during runtime. Thus, I don't want to preassign the N/G tasks to each GPU, as this will lead to an unbalanced workload distribution. Instead, I wish to launch an spmd of G labs (each one controlling a GPU) so that , whenever a lab finishes one assigned task, a new one is assigned to it in runtime, till all N tasks have been finished.

The problem with this concept is that each lab needs to know which tasks have already been assigned to other labs.

In one of the approaches I've tested, I create a file that stores the last task number. The lab reads this file, increases the task number by one and updates the file. Here the code (for N=3 tasks and G=2 )

numberTasks = 3;
lastAssignedTask = 0;
file = 'lastAssignedTask.txt';
dlmwrite(file,lastAssignedTask);
spmd
    
    while lastAssignedTask<numberTasks 
    
        
        lastAssignedTask = dlmread(file);
        fprintf('Lab:%d  read that the last assigned task was %d \n',labindex,lastAssignedTask);
        
        taskForThisLab = lastAssignedTask+1;
        lastAssignedTask = taskForThisLab;
        dlmwrite(file,lastAssignedTask);
                
        fprintf('task: %d Lab:%d \n',lastAssignedTask,labindex);
    end
    
end

However, the output is

Lab 1: 
  Lab:1  read that the last assigned task was 0 
  task: 1 Lab:1 
  Lab:1  read that the last assigned task was 1 
  task: 2 Lab:1 
  Lab:1  read that the last assigned task was 2 
  task: 3 Lab:1 
Lab 2: 
  Lab:2  read that the last assigned task was 0 
  task: 1 Lab:2 
  Lab:2  read that the last assigned task was 1 
  task: 2 Lab:2 
  Lab:2  read that the last assigned task was 2 
  task: 3 Lab:2 

Looks like each worker reads the file at the same time. I have been trying several workarounds (see the tags,) but none seems to be useful for this problem. Is there somehitng very obvious that I am overlooking?

댓글 수: 4
이전 댓글 2개 표시 이전 댓글 2개 숨기기

Arabarra 2020년 12월 29일

MATLAB Online에서 열기

Thanks for your time Walter,

in principle, tasks are totally independent. When a new tasks starts, it does not affect running tasks.

I'm looking into the idea of using an extra lab, this makes sense. But how to tell the coordinator that he has to send the status information to any other lab? I mean, I can write something like this for two workers (1 controlling, 1 doing the job)

numberTasks = 5;
lastAssignedTask = 1;
spmd
    
    if labindex==1
        % coordinator
        
        while lastAssignedTask<numberTasks
            
            labSend(lastAssignedTask,2);
            lastAssignedTask = labReceive('any'); % execution in the coordinator is blocked
            fprintf('Coordinator task: %d Lab:%d \n',lastAssignedTask,labindex);
        end
        
    else
        
         while lastAssignedTask<numberTasks
            lastAssignedTask = labReceive(1); % execution in the worker is blocked
            thisTask = lastAssignedTask;
            lastAssignedTask =lastAssignedTask+1;
            labSend(lastAssignedTask,1);
            fprintf('task: %d Lab:%d \n',thisTask,labindex);
         end
        
    end
    

and it works. But if I have 1 coordinator and 2 additional workers, I cannot replace "labSend(lastAssignedTask,2)" with "labSend(lastAssignedTask,'any');". The code needs to know to which worker the variable lastAssignedTask is to be send, and this defeats the runtime assignment of tasks... or did I totally misunderstand your pointer?

Walter Roberson 2020년 12월 29일

편집: Walter Roberson 2020년 12월 29일

You can find the number of labs. Create a vector of task numbers T, one slot per lab including one for the controller, initialized to 0s.

While T(1)<maxTasks

find first T entry after the first that is 0. If you find one, T(1)=T(1)+1, T(K) =T(1), labSend() value T(K) to lab #K, go on to next K in zero checking.

If you reached the end of T, labRecieve asking for two outputs. Second output is the labindex. Pull out the value from T() and it will tell you which task was being done (or just have the lab send the value), and do whatever with that fact. Now zero the T entry. Return to the loop that checks for zeros.

When you reach max tasks, instead of sending a task number to a slot marked 0, send a negative shut down signal to the lab and mark the slot with -1. When all slots after the first are -1, shut down lab 1 too.

Labs 2 onwards:

loop. labRecieve. if the value is negative, close the lab. Otherwise it is a task number. Do the task and then labSend something to lab 1 so it knows you are finished the task. End loop (back to the labRecieve)

Arabarra 2020년 12월 30일

편집: Arabarra 2020년 12월 30일

MATLAB Online에서 열기

I see... that's an elegant solution, thanks for pointing it to me. I attach my version of the code below (with all the fprints that I needed to debug it till it worked!), in case it might help others.

I hope future Matlab editions will offer some native tools to handle race conditions in a simpler manner.

numberTasks  = 50;
numberLabs   = 13;   % one lab more than GPUs, last one is the controller
controller = numberLabs;
taskInLab = zeros(numberLabs-1,1);
% the k-th position in T being a zero means that the k-th lab is free
% just for reference
finishedJobsInLab = zeros(numberLabs-1,1);
lastAssignedTask = 0;
completedTasks = 0;
spmd
    
    
    if labindex == controller
        while completedTasks<numberTasks
            
            
            freeLabs = find(taskInLab==0);
            
            
            if length(freeLabs)>0
                
                
                for i=1:length(freeLabs);
                    K = freeLabs(i);
                    lastAssignedTask = lastAssignedTask+1;
                    
                    if lastAssignedTask>numberTasks
                       continue; 
                    end
                    taskInLab(K)     = lastAssignedTask;
                    
                    
                    taskToPerform = taskInLab(K);
                    targetLab     = K;
                    
                     fprintf(' <- [Coordinator] about to sed task:%d goes to lab:%d \n',....
                        taskToPerform,targetLab);
                    labSend(taskToPerform,targetLab);
                    fprintf(' >- [Coordinator] task:%d goes to lab:%d \n',....
                        taskToPerform,targetLab);
                end
                
                
                % waits till ANY lab reports finishing
                completedTasks = sum(finishedJobsInLab);
                
                if completedTasks<numberTasks
                    
                    fprintf('  [coordinator] Awaiting for some job to finish (%d launched in this round) %d completed in total \n',.....
                        length(freeLabs),completedTasks);
                    
                    passedCell      = labReceive('any');
                    labThatFinished = passedCell{1};
                    fprintf('  [coordinator] finished task %d in labindex %d \n',....
                        passedCell{2},labThatFinished);
                    finishedJobsInLab(labThatFinished) = finishedJobsInLab(labThatFinished)+1;
                    
                    taskInLab(labThatFinished) = 0; % marks the lab as finished and free
                else
                    % finished; all is good
                     fprintf('  [coordinator]  %d completed in total \n',completedTasks);
                      for i=1:(numberLabs-1);
                          labSend(-1,i);     
                      end
                     
                end
                
            else
                % no lab is free; everybody is busy
                disp('cycling');
            end
            
            
            
            
        end
        
        
    else
        
        
        if labindex<controller
            while lastAssignedTask<numberTasks
  
                fprintf(' * [worker %d] viewed last assigned task: %d  \n',labindex,lastAssignedTask);
                lastAssignedTask = labReceive(controller);
                thisTask         = lastAssignedTask;
                
                pause(0.1);
                if thisTask<0
                    % this lab has finished
                    fprintf( '* [worker %d] finished \n',labindex);
                    break
                end
                               
                fprintf(' (-[worker %d] task: %d  \n',labindex,thisTask);
                % reports finishing
                passedInfo = {labindex,thisTask};
                labSend(passedInfo,controller);
                fprintf('  -)[worker %d] task: %d  [finish signal sent] \n',labindex,thisTask);
            end
            fprintf('[worker %d] final task: %d  \n',labindex,lastAssignedTask);
        else
            % do nothing, labindex is bigger than controller
        end
    end
    
    
end

D.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

Walter Roberson 2020년 12월 30일

0 개 추천

Summarizing:

To prevent race conditions, instead of having each lab work independently, you can use an extra lab as a controller that manages the task assignments. The labs wait for work with a labReceive(), and notify the controller that they are done with a labSend() . The controller assigns work to any device that does not have work, and then does a labReceive() waiting for response. When there is no more work, the controller signals shutdown.

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

race condition at asynchronous task assignment during runtime

댓글 수: 4
이전 댓글 2개 표시 이전 댓글 2개 숨기기

채택된 답변

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

추가 답변 (0개)

카테고리

제품

태그

Community Treasure Hunt

race condition at asynchronous task assignment during runtime

댓글 수: 4 이전 댓글 2개 표시 이전 댓글 2개 숨기기

채택된 답변

댓글 수: 0 이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

추가 답변 (0개)

카테고리

제품

태그

참고 항목

Community Treasure Hunt

댓글 수: 4
이전 댓글 2개 표시 이전 댓글 2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기