Specify the parallel pool job timeout

Hi,
I am running some tests using a remote cluster (no local cluster). I send my functions in batch mode. I know that the functions take a large execution time, around 2 / 4 hours. When I try to run, I get the message:
'The parallel pool job was cancelled because it failed to finish within the specified parallel pool job timout of 300 seconds'
I looked into the documentation how to change the default timeout time. The only way I could find is with the "wait" command, as
wait(job,"finished",18000);
However, I keep on getting the same error. How can I change the default parallel pool job timeout in the remote cluster?

답변 (1개)

Raymond Norris
Raymond Norris 2021년 10월 1일

0 개 추천

So you doing something like the following
cluster = parcluster;
job = cluster.batch(@mycode,...., 'Pool',size);
Then what you're suggesting is that let's say your code looked something like
function mycode
pause(10 * 60)
parfor idx = 1:N
...
end
On the cluster, the workers have a default timeout of 5 minutes and therefore errors out because you're running code (the pause) for 10 minutes before the workers are being used in the parfor.
'The parallel pool job was cancelled because it failed to finish within the specified parallel pool job timout of 300 seconds'
I tried quickly reproducing this with the local scheduler, but didn't (it shouldn't matter if I'm jusing local). How are you getting the error message? And which scheduler are you using MJS or generic (e.g. PBS)?

댓글 수: 9

Maria
Maria 2021년 10월 1일
편집: Maria 2021년 10월 1일
So, here is how my code looks like:
delete(gcp('nocreate'));
parallel.defaultClusterProfile('MatlabCluster')
c = parcluster();
N = 32;
job = batch(c,@compute_H_matrix,1,{large data inputs},'Pool',N-1);
The function "compute_H_matrix" has some if /while conditions, then it calls another funciton, "internal_compute_H". In the internal_compute_H, there are 2 for loops, and the outest one is a parfor. In these loops, I fill in the elements of the matrix H, with a call to another function (I know, it is an involved code), "compute_integrals_H", which finally calls a mex file "compute_Ihp_mex" (the mex is created in Linux, so compatible with the cluster). The reason for all the sub-calls is because the geometry of the problem is handled in each function based on some criteria identified from the input.
And after 5 minutes the job "fails" and, from the getReport(job.Tasks(1).Error), I get this error message. The scheduler is MJS
Maria
Maria 2021년 10월 1일
편집: Maria 2021년 10월 1일
I am thinking about what you said. The thing is that the checks and the steps done to enter the parfor are very quick. It is the parfor+for that requires time, because the integral computation takes time, and it is run over 10000 x 10000 elements. But as I said, all the checks done before the parfor are very quick, I can use the debugger and reach the parfor without any problem, because this is not the bottelneck, at least not on my computer...
In the function "internal_compute_H", I use a distributed array to allocate the matrix before the parfor. Can this be the problem?
Maria
Maria 2021년 10월 1일
편집: Maria 2021년 10월 1일
I took away the "distributed" array, but the job failed with this message
['Cannot rerun job because at least one of its tasks ha no rerun attempts left (The task has no rerun attempts left.).' ...
'Original cancel message:' ...
'MATLAB worker exited with status 9 during task execution.' ...
'Transport stopped. ']
Ok. This timeout is specific to MJS. Open the Cluster Profile Manager and look at MatlabCluster. What are the TIMEOUTS set to? Inf? or 5 minutes?
Maria
Maria 2021년 10월 1일
5 min
That'll do it ;)
Secondly, parfor and distributed arrays aren't meant to be combined. Within an iteration of a parfor-loop, the entire unit of work is onto its own. There's no inner-process communication amongst the other workers. So using distributed arrays within parfor is pointless (not supported?).
Aha, this is good to know!
But now I took away the distributed and I get
MATLAB worker exited with status 9 during task execution.
What does that mean?
One of the workers crashed, possibly because of an out of memory issue. Email Technical Support (support@mathworks.com) and they can walk you through debug steps and getting the MJS log files to troubleshoot.

댓글을 달려면 로그인하십시오.

제품

릴리스

R2021a

질문:

2021년 10월 1일

댓글:

2021년 10월 2일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by