why does my job on cluster stop to produce output

조회 수: 2 (최근 30일)
Patrick Laux
Patrick Laux 2016년 4월 12일
댓글: Simone Stünzi 2021년 7월 9일
Hey, I am using parallel toolbox on a linux cluster (istan nodes and SLURM scheduler). The main routine (the parfor loop section) looks as follows. An 2-d array (MASKE) is used to extract time series which have values, and the function core_eQM is applied on these time series: ...
cd $WORKDIR
pc = parcluster('local')
pc.JobStorageLocation = strcat('$WORKDIR/',getenv('SLURM_JOB_ID'))
% start the matlabpool with maximum available workers
% control how many workers by setting ntasks in your sbatch script
matlabpool(pc, getenv('SLURM_CPUS_ON_NODE'))
...
pardim=size(MASKE,2);
XX1=NaN(7305,pardim);
parfor ii=1:pardim
if ~isnan(MASKE(1,ii))
fprintf('%i \t \n', ii); %shows me progress of job (creates files, which are empty)
if meth == 1;
xx1 = core_eQM(squeeze(VARIABLE_BSE(ii,jj,1:3653)),squeeze(WRF_VARIABLE(ii,jj,1:3653)),squeeze(WRF_VARIABLE(ii,jj,3654:7305)))
elseif meth == 2
...
end
end
end
Now, I use the following script to submit it on the cluster:
#!/bin/bash
#SBATCH --job-name=imat_par_test
#SBATCH --output=matlab_parfor.out
#SBATCH --error=matlab_parfor.err
#SBATCH --partition=ivy
#SBATCH --time=72:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=20
source /etc/profile.d/00-modules.sh
module load app/matlab2014b
cd $WORKDIR
# Create a local work directory
mkdir -p $WORKDIR/$SLURM_JOB_ID
#cd $WORKSDIR/$SLURM_JOB_ID
# Kick off matlab
matlab -nodesktop < script_apply_BC.m &
#wait
# Cleanup local work directory
rm -rf $WORKSDIR/$SLURM_JOB_ID
At the beginning (first few hours) the job runs fine. The size of pardim is 420. After pardim reaching approx. 250, the procedure slows down and finally does not "continue", i.e. the job is still running without producing output files. Thus, no problems are reported in the matlab_parfor.err file. I do not know exactly how I can analyse the problems in this case.
Any ideas?
  댓글 수: 5
Patrick Laux
Patrick Laux 2021년 7월 9일
unfortunately not, Simone. I just gave up.
If you find out more, I would be happy if you let me know.
Patrick
Simone Stünzi
Simone Stünzi 2021년 7월 9일
I've increased idleTimeout to Inf and will let you know if that solves my issue.
Best, Simone

댓글을 달려면 로그인하십시오.

답변 (0개)

카테고리

Help CenterFile Exchange에서 Third-Party Cluster Configuration에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by