Handling errors in parfeval processes

조회 수: 30 (최근 30일)
Mark Brandon
Mark Brandon 2023년 4월 2일
댓글: Mark Brandon 2023년 4월 12일
I am running a conventional parallel computing arrangement with a client and a number of workers. The client distributes jobs to the workers using parfeval, and then retrieves solutions using fetchnext.
In rare instances, a worker process will fail, usually due to a computation that consumes too much memory. I am not able to fully inspect this failure, nor am I am able to construct a simple example of the failure. I do observe that the solution from this failed process is missing in my output log, and the remaining jobs continue to be sucessfully processed.
I have yet to find any documentation about how Matlab handles process failures associated with parfeval. Nor have I found a listing of the error messages that can be reported by in the Futures object (i.e., futures.Error.messages).
At present, I am thinking about the following questions:
  1. Is the output argument in the Futures object for the failed job set to a specific value?
  2. Does the worker with the failed process continue to operate as part of the parpool, or is it compromised by the failure?
  3. Does the error message in the Futures object for the failed job provide information about a memory failure?
Best, Mark
  댓글 수: 1
Bruno Luong
Bruno Luong 2023년 4월 2일
+1 for the question.
I find the way MATLAB handles errors in case of parallel computing is not very convenient for debugging.

댓글을 달려면 로그인하십시오.

채택된 답변

Walter Roberson
Walter Roberson 2023년 4월 2일
You can potentially use try/catch to control errors on the workers.
If there is an error then the hidden property OutputArguments of the future will be {} -- same as if there had been no outputs in normal circumstances.
There is an Error property for future objects. Once the State is 'finished (unread)' then if there was no execution error then the Error property will be empty. If the error property is non-empty then it will have a field remotecause that contains an exception object.
The worker itself will have recovery operations done on it automatically. It will not, however, clean up all state, so if you assigned a bunch of large variables then they might still exist in the workers.
  댓글 수: 19
Walter Roberson
Walter Roberson 2023년 4월 12일
I have never personally had to worry about crashing futures, but I can see that it could be useful in general to have some kind of configurable retry limit on any technology that automatically retries on failure. I would imagine that the most commonly used values would be 0 (no retries), 1 (one retry), inf (keep retrying), but I can imagine that in some cases people might want (for example) 5 or 10 retries.
Mark Brandon
Mark Brandon 2023년 4월 12일
@Sam Marshalik@Walter Roberson. Thanks for the comments. I can confirm Sam's description of how parfeval works on a local system. I set up a parpool on a single node at your HPC system, and I defined low limit for maximum memory for the job. I started with 27 workers and a total 50 Gb of memory for all workers combined. The futures for the job would have moments where they needed to use ~15 Gb each. As the parfeval job ran, the number of active workers quickly dropped down to about 3. That said, the submitted futures all ran successful, despite the chaos of the crashing (failing) workers.

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 Startup and Shutdown에 대해 자세히 알아보기

제품


릴리스

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by