Please suggest me to write the parallel loop in MATLAB without the workers getting aborted during the course of execution? Thanks

조회 수: 15 (최근 30일)
I am facing the issue of workers getting aborted while the code is running parallelly using parfor. It may be due to the non-uniform distribution of workload on the workers. Hence, few workers sit idle once the job allocated to them is finished. I tried setting the idle timeout to be infinity and set the spmd to be false. But it didn't help me. The error I am getting is described below:
[Warning: A worker aborted during the execution of the parfor loop. The parfor loop
will now run again on the remaining workers.]
[> In distcomp/remoteparfor/handleIntervalErrorResult (line 240)
In distcomp/remoteparfor/getCompleteIntervals (line 387)
In parallel_function>distributed_execution (line 745)
In parallel_function (line 577)
In generateantenna_dscale (line 37)]
Can anyone suggest to me how to write my parallel loop such that code is executed uniformly on all the workers without getting them disconnected during the course of execution? How to write a parallel code in MATLAB when the time taken by the iteration is independent and depends on the random number generated inside the parfor loop? A code snippet will be of great help. Thanks in advance

채택된 답변

Raymond Norris
Raymond Norris 2021년 9월 24일
The entire pool idles out, not a single worker. My guess is that a worker is crashing because of memory. Tell us a bit more
  • the scheduler you're using (local, MJS, generic (e.g. PBS))
  • number of cores per node
  • RAM per node
  • size of the pool you're running
  • size of data being sent back and forth
  댓글 수: 3
Raymond Norris
Raymond Norris 2021년 9월 25일
With 192 GB/node and 40 cores, each worker will be allocated roughly 4-5GB. The recommendation is to allocate 4 GB for MATLAB, without knowing much about the work you're doing. Since you're running "workers" instead of full-blown "matlab", let's scale the 4 GB recommendation to 2 GB. That leaves you with about 100 GB for all 40 workers. Does that give you enough memory to do the job?
To troubleshoot this a bit:
  • Use a system tool (e.g. top on Linux or Task Manager on Windows) to monitor your RAM as you run your local job. This ought to tell you if you're exceeding your memory limits.
  • I'm going to assume you use all 40 workers, and although your work might be time intensive, I'm betting is more memory intensive. My suggestion would be to scale back the 40 workers to, say, 20. This gives each worker twice the amount of memory to work with on the single node. It may take a bit longer than you'd like, but the parfor-loop might also come to completion.
  • ticBytes/tocBytes won't show memory consumption in the parfor, but it will at least help see how much is getting passed back and forth. Initially, scale back the size of the data in your code. Somewhere, either inside or outside of the parfor, you're generating the data. For example
data = rand(100000);
data = fread(fid);
data = ...
  • Find ways to scale back how much your reading, generating, etc. until you get to a manageable size. If you then find you can only process, say, a 1/3 of what you need, start thinking about add more nodes (and therefore access to more memory) and scaling with MATLAB Parallel Server across those nodes.
Kedar Pakhare
Kedar Pakhare 2023년 4월 8일
Have you resolved the issue that you were facing with regard to parallel computing and worker abortion? I am facing almost identical issue and do not know how to debug it.
It will be a great help if you can help me out. Thanks in advance.

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 MATLAB Parallel Server에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by