Move from parfor to parfeval?

조회 수: 7 (최근 30일)
Martin Ryba
Martin Ryba 2022년 1월 25일
댓글: Martin Ryba 2022년 2월 6일
I have a large simulation that uses a LSF cluster that supports Parallel Toolbox. Right now, the meat of the effort is in a parfor loop:
% Loop over cells of storm
celllist = find(~gotResult);
stormtmp = storm(celllist); % storm is array of handle-type classes
parfor i = 1:length(celllist)
icell = celllist(i);
fprintf('Cell %d...\n', icell);
matfileObj = pooldata.Value; %#ok<PFBNS>
ofac = zeros(sizevec,'single');
ofac = stormtmp(i).obsMatrix(grid,ofac,geom,rparms);
matfileObj.gotResult(1,icell) = true;
matfileObj.testOut(1,icell) = {ofac};
end
poolData is a parallel.Pool.Constant with a matfile object that is specific to each worker. At the end of the loop, I consolidate the results, clear the temporary files, and go to the next iteration of the model. This gives me robustness against cluster crashes which when the total job takes a month can be distressingly common (hence the checking for a gotResult at the start; I can have a partial result prior to a crash). The primary annoyance with parfor is that the allocation of units to workers happens rigidly at the start. With 60+ workers on 5-10 computers in a shared facility, there is no guarantee that they take a similar amount of time to finish. I find the last 25% or more of the execution time of each iteration is spent waiting for a dwindling number of my workers to finish their assignments. I've read the documentation for parfeval, and it seems to give me a way of more carefully managing each execution, but it's too convoluted to see how I get there. Any tips? It would seem that I would start with a find() to get the first N cells (N = # of workers) that need completing, and then enter a while loop using afterEach() where I could check for a valid result, and then get the next one on the list and assign it? Maybe I can do it with one main while loop and just remove entries from celllist() as they complete? Head is spinning...
  댓글 수: 1
Jeff Miller
Jeff Miller 2022년 1월 25일
So the problem is that lots of workers are sitting idle while waiting for the slow ones to finish their assignments in this parfor loop? Instead of waiting here, you'd like to have them start on their assignments for the next parfor loop for the next "iteration of the model".
If that's right, maybe a simpler approach is to change this parfor loop so that it runs through all the cell/model combinations, something like this:
parfor i = 1:length(celllist)*NofModelIterations
For each i you'd need a little logic to work out which model and which cell you wanted, plus invoke the right model, but that might not be hard.

댓글을 달려면 로그인하십시오.

채택된 답변

Edric Ellis
Edric Ellis 2022년 1월 26일
To run using parfeval, you basically need to pull out the body of your parfor loop into a function, something like this:
celllist = find(~gotResult);
stormtmp = storm(celllist); % storm is array of handle-type classes
futures = [];
for i = 1:length(celllist)
% Schedule computation for each index in celllist. Each individual
% function evaluation is executed separately on the workers.
futures(i) = parfeval(@oneComputation, 0, i, stormtmp(i), poolData)
end
% You could simply wait for completion like this:
wait(futures)
function oneComputation(icell, stormEl, poolData)
matfileObj = pooldata.Value;
ofac = zeros(sizevec,'single');
ofac = stormEl.obsMatrix(grid,ofac,geom,rparms);
matfileObj.gotResult(1,icell) = true;
matfileObj.testOut(1,icell) = {ofac};
end
Note I simply added a call to wait(futures) after scheduling the work - as I understand it, the worker results are all stored in the mafileObj.
It might be worth taking this approach a step further. If each worker computation takes a "long" time, then you might be better off using batch jobs. This will share resources on your cluster better because you don't need to keep a parallel pool running, with possibly-idle workers. The API to batch is similar to the API to parfeval, and specifies a single function evaluation on a worker. One difference is that it doesn't support parallel.pool.Constant, so you'd need to build the matfileObj directly on the worker.
  댓글 수: 13
Martin Ryba
Martin Ryba 2022년 2월 1일
OK, my admins informed me how the lmstat statistics don't reflect Parallel Engine license states right and pointed me to the right tool.
Martin Ryba
Martin Ryba 2022년 2월 6일
Hi Edric, thanks for the thoughtful and useful answers, but I think I uncovered a bug in how MATLAB invokes the LSF bsub command for the resulting job array. All independent jobs are resources with a numWorkers of 1, as shown below in submitIndSharedJob.m:
resourceArg = ThirdPartySchedulerUtils.generateResourceArgument( lsf.ResourceTemplate, 1, lsf.NumThreads );
However, a job array with a number of tasks is really a parallel job, and you should use the number of workers values (min, max) that are in the cluster definition. Right now, with bsub -n 1, the LSF scheduler either ignores or throws errors on the spanning instructions I try giving it and so I end up overloading one host and not spreading things out enough for good efficiency. I'm trying to also work the IBM LSF side to see if it's an issue on their end with job arrays, but I think the -n 1 isn't helping.
Your reaction/counter-arguments of course greatly appreciated. If you want to take this to email, please use the address in my profile. I know I also don't want to disable job arrays or make it numerous separate jobs; each job generates a 180 MB JobXX.in.mat file.

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 Parallel for-Loops (parfor)에 대해 자세히 알아보기

제품


릴리스

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by