Strange core usage when running Slurm jobs
조회 수: 6 (최근 30일)
이전 댓글 표시
I'm trying to run jobs on an HPC cluster using Slurm, but I run into problems both when I'm running interactive jobs and when I'm submitting batch jobs.
- When I run interactive jobs and I book one node, then I manage to use all of the node's 20 cores. But when I book more than one node for an interactive job, then the cores on the extra nodes are just left unused.
- When I run a batch job, then the job uses only one core per node.
Do you have any idea what I might be doing wrong?
1. I book my interactive job from the command prompt using the following commands:
interactive -A myAccountName -p devel -n 40 -t 0:30:00
module load matlab/R2023a
matlab
to submit a 30-minute 40-core job to the "devel" partition using my account (not actually called "myAccountName"), load the Matlab module and launch Matlab as an X application. Once in Matlab, I first choose the "Processes" parallel profile and second run the "Setup" and "Interactive" sections in the silly little script at the bottom of this question. In two separate terminal sessions, I then use
ssh MYNODEID
htop
where MYNODEID is either of the two nodes assigned to the interactive job. Then I see that the job uses all of the cores on one of the nodes and none of the cores on the second node.
2. To book my batch job, I load and launch Matlab from the command prompt using the following commands
module load matlab/R2023a
matlab
and then run the "Setup" and "Batch" sections in the silly little script at the bottom of this question. Using the same procedure as above, htop lets me see that the job uses two cores (one on each node) and leaves the remaining 38 cores (19 on each node) unused.
Silly little script
%% Setup
clear;
close all;
clc;
N = 1000; % Length of fmincon vector
%% Interactive
x = solveMe(randn(1, N));
%% Batch
Cluster = parcluster('rackham R2023a');
Cluster.AdditionalProperties.AccountName = 'myAccountName';
Cluster.AdditionalProperties.QueueName = 'devel';
Cluster.AdditionalProperties.WallTime = '0:30:00';
Cluster.batch( ...
@solveMe, ...
0, ...
{}, ...
'pool', 39 ...
); % Submit a 30-minute 40-core job to the "devel" partition using my account (not actually called "myAccountName")
%% Helper functions
function A = slowDown()
A = randn(5e3);
A = A + randn(5e3);
end
function x = solveMe(x0)
opts = optimoptions( ...
"fmincon", ...
"MaxFunctionEvaluations", 1e6, ...
"UseParallel", true ...
);
x = fmincon( ...
@(x) 0, ...
x0, ...
[], [], ...
[], [], ...
[], [], ...
@(x) nonlinearConstraints(x), ...
opts ...
);
function [c, ceq] = nonlinearConstraints(x)
c = [];
A = slowDown();
ceq = 1 ./ (1:numel(x)) - cumsum(x);
end
end
댓글 수: 0
채택된 답변
Damian Pietrus
2024년 3월 19일
Based on your code, it looks like you have correctly configured a cluster profile to submit a job to MATLAB Parallel Server. In this case, your MATLAB client will always submit a secondary job to the scheduler. It is in this secondary job that you should request the bulk of your resources. As an example, on the cluster login node you should only ask for a few cores (enough to run your MATLAB serial code), as well as a longer WallTime:
% Two cores, 1 hour WallTime
interactive -A myAccountName -p devel -n 2 -t 1:00:00
module load matlab/R2023a
matlab
Next, you should continue to use the AdditionalProperties fields to shape your "inner" job:
%% Batch
Cluster = parcluster('rackham R2023a');
Cluster.AdditionalProperties.AccountName = 'myAccountName';
Cluster.AdditionalProperties.QueueName = 'devel';
Cluster.AdditionalProperties.WallTime = '0:30:00';
When you call the MATLAB batch command, this is where you can then request the total amount of cores that you would like your parallel code to run on:
myJob40 = Cluster.batch(@solveMe, 0, {},'pool', 39);
myJob100 = Cluster.batch(@solveMe, 0, {},'pool', 99);
Notice that since this submits a completely separate job to the scheduler queue, you can choose a pool size larger than you requested in your 'interactive' CLI command. Also notice that the Cluster.AdditionalProperties WallTime value is shorter than the 'interactive' value. This is to account for the time that the inner job may wait in the queue.
Long story short -- when you call batch or parpool within a MATLAB session that has a Parallel Server cluster profile setup, it will submit a secondary job to the scheduler that can have its own separate resources. You can verify this by manually veiwing the scheduler's job queue.
Please let me know if you have any further questions!
댓글 수: 4
Damian Pietrus
2024년 3월 21일
Thanks for including that -- It looks like your integration scripts are from around 2018. Since they are a bit out of date, they don't include some changes that will hopefully fix the core binding issue you're experiencing. I'll reach out to you directly, but for anyone else that finds this post in the future, you can get an updated set of integration scripts here:
추가 답변 (0개)
참고 항목
카테고리
Help Center 및 File Exchange에서 Third-Party Cluster Configuration에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!