필터 지우기
필터 지우기

Parfor loop with mex-file call crashes all workers on one computer, but runs fine on others

조회 수: 26 (최근 30일)
Hello!
We have a code which runs either in serial or parallel mode. In the part under inspection one of our mex-files is run on either the complete set of data (serial operation) or on a part of the data consistent with the current number of workers (parallel operation). When run in serial mode the code works fine, but when run in parallel mode on a 6-core computer (with 2, 4 or 6 workers) the workers crash with messages like:
Warning: A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining
workers.
> In distcomp.remoteparfor/handleIntervalErrorResult (line 240)
In distcomp.remoteparfor/getCompleteIntervals (line 387)
In parallel_function>distributed_execution (line 745)
In parallel_function (line 577)
In foo3 (line 138)
In foo2 (line 142)
In foo1 (line 61)
In foo (line 166)
A little later sometimes we get:
Error using distcomp.remoteparfor/rebuildParforController (line 194)
All workers aborted during execution of the parfor loop.
Error in distcomp.remoteparfor/handleIntervalErrorResult (line 253)
obj.rebuildParforController();
Error in distcomp.remoteparfor/getCompleteIntervals (line 387)
[r, err] = obj.handleIntervalErrorResult(r);
...
The client lost connection to worker 3. This might be due to network problems, or the interactive communicating job might
have errored.
Warning: 4 worker(s) crashed while executing code in the current parallel pool. MATLAB will attempt to run the code again
on the remaining workers of the pool. View the crash dump files to determine what caused the workers to crash.
The crash dumps don't say a lot, but conclude with:
This error was detected while a MEX-file was running. If the MEX-file
is not an official MathWorks function, please examine its source code
for errors. Please consult the External Interfaces Guide for information
on debugging MEX-files.
When run on three other computers the code works fine, in both serial and parallel mode. Two computers with 6-core CPU:s and a notebook with a 2-core CPU. It is possible to create different size pools, and the resulting output is always as expected. I have tried the code on all computers using the same number of workers (4) where possible.
This leads me to believe the mex-file is correct.
I am at a loss concerning what to try next, and would appreciate any hints on how to move forward.
  댓글 수: 7
Christopher Grose
Christopher Grose 2019년 12월 3일
편집: Christopher Grose 2019년 12월 3일
I'm having this problem as well.
Could it occur because of uninitialized variables in the mex C function (I don't think i have any uninitialized pointers)?
For example, my C codes look something like
void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
/* DECLARATIONS, INPUTS */
double *Gphase = mxGetPr(prhs[0]);
double *dGdC = mxGetPr(prhs[2]);
double *StrainEM = mxGetPr(prhs[4]);
double R = mxGetScalar(prhs[5]);
int32_t p,c,ci,j;
for (j=0; j<1000; j++) {
somefunc(Gphase,dGdC,StrainEM,R,p,c,ci);
}
}
Could p,c,ci,j variables be screwing things up?
On a 28 core CPU I quickly lose half my workers, followed by a random but gradual elimination of further workers.
I get the error
Warning: A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining workers.
and the worker that takes over complete the task without problems, and the functions are all fully deterministic, so the workers are just dying for some other problem
Jan
Jan 2019년 12월 4일
It depends on what happens inside somefunc(). The output of mxGetPr() should be treated as const pointer, so do you modify the contents? p, c, and ci are declared, but not initialized - do you use them correctly? Maybe "something like" conceals the actual problem. Please post the relevant part of the real code.

댓글을 달려면 로그인하십시오.

답변 (0개)

카테고리

Help CenterFile Exchange에서 Parallel for-Loops (parfor)에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by