Hello,
I am trying to run a simple parfor script on nodes on our cluster. The code works fine until I try to use > 46 CPUs (workers) at once, on one server. Some of our latest nodes have 128 AMD cores. I can run up to 56 cores on our Intel CPU servers (nodes) , but on any AMD I get errors (java runtime and others) when using >46 cores. It would be great to use all 128 cores on these new nodes for our MATLAB code. I have tried increasing memory and I still get these errors when using > 46 cores.
I will attach the MATLAB crash dump, code and sbatch files.
My sbatch file (I have tried many, many different parameters) -
#!/bin/bash
#SBATCH -J pfor_matlab
#SBATCH -o pfor".%j".out
#SBATCH -e pfor".%j".err
#SBATCH -t 45:00
#SBATCH -N 1
#SBATCH -p normal
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=48
module load matlab
hostname -s
env | egrep SLURM
matlab -nosplash -nodesktop -r "pfor"
The sbatch produces this output in the SLURM .err file-
Error using parpool (line 145)
Parallel pool failed to start with the following error. For more detailed
information, validate the profile 'local' in the Cluster Profile Manager.
Error in pfor (line 5)
parpool('local', str2num(getenv('SLURM_CPUS_PER_TASK')))
Caused by:
Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line
670)
Failed to initialize the interactive session.
Error using
parallel.internal.pool.InteractiveClient>iThrowIfBadParallelJobStatus
(line 781)
The interactive communicating job failed with no message
Thank you for any pointers!
Mark

댓글 수: 2

Walter Roberson
Walter Roberson 2021년 2월 9일
The volunteers are not likely to know the solution for this; you should open a support case.
Mark PIERCY
Mark PIERCY 2021년 2월 9일
Thanks Walter, I just submitted one.
Best,
Mark

댓글을 달려면 로그인하십시오.

 채택된 답변

Mark PIERCY
Mark PIERCY 2021년 3월 2일
편집: Mark PIERCY 2021년 3월 2일

1 개 추천

This happens on either AMD or Intel nodes in our cluster. On our 128-core nodes, parpool(128) fails with an nproc limit at 32,768.
Our systems architect figured this out. It turns out that Matlab was seg faulting on the default nproc limit (max number of user processes), which is set by default to 4,096.
The issue is with the /etc/security/limits.d/20-nproc.conf provided by the PAM RPM on CentOS 7, which limits every user to 4096 processes at once. But not for Matlab.
Details:
>> parpool('local', 46)
Starting parallel pool (parpool) using the 'local' profile ...
*** Error in `/share/software/user/restricted/matlab/R2020a/bin/glnxa64/MATLAB': double free or corruption (!prev): 0x00007f4e4027d090 ***
*** Error in `/share/software/user/restricted/matlab/R2020a/bin/glnxa64/MATLAB*** Error in `/share/software/user/restricted/matlab/R2020a/bin/glnxa64/MATLAB': free(): corrupted unsorted chunks: 0x00007f4e401e10d0 ***
A bad case of segmentation fault:
[7191121.585896] MATLAB[120580]: segfault at 118c0f20 ip 00007fa1cd076b8d sp 00007fa1973fbc60 error 4 in libc-2.17.so[7fa1cd03d000+1c2000]
[7191121.610055] traps: MATLAB[120453] general protection ip:7fcb5d0b0b8d sp:7fcb273fbc60 error:0 in libc-2.17.so[7fcb5d077000+1c2000]

추가 답변 (0개)

카테고리

도움말 센터File Exchange에서 Third-Party Cluster Configuration에 대해 자세히 알아보기

제품

릴리스

R2020a

질문:

2021년 1월 19일

댓글:

2021년 3월 2일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by