Cluster multi-gpu training Error: Current pool is not local.
조회 수: 6 (최근 30일)
I am trying to scale up onto a multi-gpu cluster for deep learing. I can run the model on a single GPU on the cluster with no issues, however when I try to change to multiple GPU's I get this error:
Current pool is not local. Use 'delete(gcp)' to close parallel pool and run again.
My cluster submission function looks like this:
function job = submit_train_script()
cluster = parcluster();
cluster.AdditionalProperties.AdditionalSubmitArgs = '--gres=gpu:4'; % Request 4 GPU's with sbatch
cluster.AdditionalProperties.AdditionalSubmitArgs = '--mail-type=ALL'; % Send me an email if anything happens
cluster.AdditionalProperties.AdditionalSubmitArgs = '--email@example.com';
cluster.AdditionalProperties.AdditionalSubmitArgs = '--nodelist=Node002'; % Use node002
% Submit the job, ask for 4 CPU workers, one for each GPU
job = cluster.batch('train_fun', ...
"AutoAddClientPath",false, "CaptureDiary",true, ...
With the network options below. I request 4 GPU's, four worker CPU's to match and then set the exicution enviroment to "multi-gpu". This appears to be the recommended configuration for this type of work. I cannot work out what is causing this error.
% Iteration = Number of (files*cells) / Minibatchsize
options = trainingOptions("adam", ...
ExecutionEnvironment="multi-gpu", ... % cpu,gpu multi-gpu option avaliable
MaxEpochs=50, ... % 50
MiniBatchSize= 10, ... % 25 miniBatchSize, ... 10 for 16Gb card,
net = trainNetwork(ds,layers,options);
Thanks in advance,
Edric Ellis 2023년 1월 13일
I think you need to specify ExecutionEnvironment="parallel" for this situation. According to the trainingOptions reference page, "multi-gpu" is only for "multiple GPUs on one machine, using a local parallel pool based on your default cluster profile."