필터 지우기
필터 지우기

Error when using parallel computing toolbox

조회 수: 43 (최근 30일)
Florian
Florian 2023년 7월 12일
댓글: Farhad 2023년 10월 6일
Hi,
I am running matlab on a Linux cluster, unsing the parallel computing toolbox. While everything worked out so far, suddenly when I ran my code I get the following error:
Parallel pool failed to start with the following error. For more detailed information, validate the profile 'Processes' in the Cluster Profile Manager.
Error in samplescript_expdata (line 278)
parpool(8)
Caused by:
Error using parallel.internal.pool.SpfClientSession
An unexpected error occurred accessing a parallel pool. The underlying error was: Timeout binding/connecting to specified endpoints.
Thus I went to the cluster profile manager and ran the validation process. However, the last step fails as well reporting the following lengthy error message that I am going to paste below.
Does anyone have any idea what's wrong and how I can fix this?
Thank you in advance for your help!
Stage started at 2:20:01 PM. Completed in 0 min 30 sec.
Error Report: An unexpected error occurred accessing a parallel pool. The underlying error was: Timeout binding/connecting to specified endpoints.
Debug Log: CLIENT LOG OUTPUT
Session starting on cluster type: Local, with name: Processes
Session failed to start when creating InteractiveClient. Error: Error using parallel.internal.pool.SpfClientSession
An unexpected error occurred accessing a parallel pool. The underlying error was: Timeout binding/connecting to specified endpoints.
Error in parallel.internal.pool.CppBackedSession.buildInteractiveClient (line 397)
clientSession = parallel.internal.pool.SpfClientSession(bindEndpoint, ...
Error in parallel.internal.pool.AbstractClusterPool>@(c)parallel.internal.pool.CppBackedSession.buildInteractiveClient(c,sessionInfo) (line 839)
sessionBuildFcn = @(c) parallel.internal.pool.CppBackedSession.buildInteractiveClient(c, sessionInfo);
Error in parallel.internal.pool.AbstractInteractiveClient/start (line 100)
[session, connFcn] = sessionBuildFcn(clus);
Error in parallel.internal.pool.AbstractClusterPool>iStartClient (line 874)
spmdInitialized = client.start(sessionBuildFcn, sessionInfo, numWorkers, cluster, ...
Error in parallel.internal.pool.AbstractClusterPool.hBuildPool (line 636)
iStartClient(client, sessionInfo, forceSpmdEnabled, cluster, supportRestart, argsList);
Error in parallel.internal.types.ValidationStages>iOpenPoolForCluster (line 510)
aPool = parallel.internal.pool.AbstractClusterPool.hBuildPool('Cluster', cluster, ...
Error in parallel.internal.types.ValidationStages>@()iOpenPoolForCluster(runInfo)
Error in parallel.internal.types.ValidationStages>iCallWithNoHotlinks (line 391)
[varargout{1:nargout}] = fcn();
Error in parallel.internal.types.ValidationStages>iRunParpoolStage (line 302)
[commandWindowOutput, aPool] = evalc(iWrapForEvalc(openPoolFcn));
Error in parallel.internal.types.ValidationStages/run (line 74)
[eventData, runInfo] = obj.RunFunction(obj, runInfo);
Error in parallel.internal.validator.Validator/runValidationSuite (line 191)
[eventData, stageRunInfo] = currentStage.run(stageRunInfo);
Error in parallel.internal.validator.Validator/validate (line 103)
status = obj.runValidationSuite(profileName, suite);
Error in parallel.internal.ui.AbstractValidationManager/validate (line 36)
obj.Validator.validate(profileName, validationSuite);
Error in parallel.internal.ui.ValidationManager.validateProfile (line 36)
parallel.internal.ui.ValidationManager.getOrCreateInstance().validate(profileName, suite);
Session failed to start with message: Error using parallel.internal.pool.SpfClientSession
An unexpected error occurred accessing a parallel pool. The underlying error was: Timeout binding/connecting to specified endpoints.
Error in parallel.internal.pool.CppBackedSession.buildInteractiveClient (line 397)
clientSession = parallel.internal.pool.SpfClientSession(bindEndpoint, ...
Error in parallel.internal.pool.AbstractClusterPool>@(c)parallel.internal.pool.CppBackedSession.buildInteractiveClient(c,sessionInfo) (line 839)
sessionBuildFcn = @(c) parallel.internal.pool.CppBackedSession.buildInteractiveClient(c, sessionInfo);
Error in parallel.internal.pool.AbstractInteractiveClient/start (line 100)
[session, connFcn] = sessionBuildFcn(clus);
Error in parallel.internal.pool.AbstractClusterPool>iStartClient (line 874)
spmdInitialized = client.start(sessionBuildFcn, sessionInfo, numWorkers, cluster, ...
Error in parallel.internal.pool.AbstractClusterPool.hBuildPool (line 636)
iStartClient(client, sessionInfo, forceSpmdEnabled, cluster, supportRestart, argsList);
Error in parallel.internal.types.ValidationStages>iOpenPoolForCluster (line 510)
aPool = parallel.internal.pool.AbstractClusterPool.hBuildPool('Cluster', cluster, ...
Error in parallel.internal.types.ValidationStages>@()iOpenPoolForCluster(runInfo)
Error in parallel.internal.types.ValidationStages>iCallWithNoHotlinks (line 391)
[varargout{1:nargout}] = fcn();
Error in parallel.internal.types.ValidationStages>iRunParpoolStage (line 302)
[commandWindowOutput, aPool] = evalc(iWrapForEvalc(openPoolFcn));
Error in parallel.internal.types.ValidationStages/run (line 74)
[eventData, runInfo] = obj.RunFunction(obj, runInfo);
Error in parallel.internal.validator.Validator/runValidationSuite (line 191)
[eventData, stageRunInfo] = currentStage.run(stageRunInfo);
Error in parallel.internal.validator.Validator/validate (line 103)
status = obj.runValidationSuite(profileName, suite);
Error in parallel.internal.ui.AbstractValidationManager/validate (line 36)
obj.Validator.validate(profileName, validationSuite);
Error in parallel.internal.ui.ValidationManager.validateProfile (line 36)
parallel.internal.ui.ValidationManager.getOrCreateInstance().validate(profileName, suite);.
Failed to run the DisarmableOncleanup callback due to the following error:
Unrecognized method, property, or field 'pStopLabsAndDisconnect' for class 'parallel.internal.pool.InteractivePoolClient'.
  댓글 수: 1
Edric Ellis
Edric Ellis 2023년 7월 13일
I suggest you contact MathWorks support for help with this.

댓글을 달려면 로그인하십시오.

답변 (1개)

Debadipto
Debadipto 2023년 8월 1일
Hi Florian,
Please refer to the following article:
If this doesn't solve the issue, then please reach out to MathWorks support for help.
Regards,
Debadipto Biswas
  댓글 수: 2
Farhad
Farhad 2023년 10월 6일
Hello,
i am also running Parallel Server on a cluster with SLURM as scheduler.
I created a generic profile as there is no shared storage between the users(clients) and the worker nodes on the validation process everything is running fine except the last step and i get the same error message posted above .
Unfortunately i cant acces the link you posted.
Error using parallel.internal.pool.SpfClientSession
An unexpected error occurred accessing a parallel pool. The underlying error was: Timeout binding/connecting to specified endpoints.
Farhad
Farhad 2023년 10월 6일
Update:
I also have the https://github.com/mathworks/matlab-proxy in use on the cluster.
When i first start a session through the matlab-proxy and then use the Slurm Profile i can successfully run.
The Output shows:
Got clientEndpoint to connect to worker 1 with URL: tcp://tcpnodelay=node0:27583/protocol/catapult
Got clientEndpoint to connect to worker 2 with URL: tcp://tcpnodelay=node0:27370/protocol/catapult
Client starting to connect to workers
Connected to parallel pool with 2 workers.
But when i try the same from my windows matlab client it doesn't work.
I get the same output almost :
Got clientEndpoint to connect to worker 1 with URL: tcp://tcpnodelay=node0:27583/protocol/catapult
Got clientEndpoint to connect to worker 2 with URL: tcp://tcpnodelay=node0:27370/protocol/catapult
But then after while (timeout duration) the connection fails
Error using parallel.internal.pool.SpfClientSession
An unexpected error occurred accessing a parallel pool. The underlying error was: Timeout binding/connecting to specified endpoints.
As i am using a generic profile where i define AdditionalProperties ClusterHost i put in the public available domain name of the Login/Head-Node of the Cluster but the workers themself are not reachable from outside.
So i guess the failure of binding/connecting is due to the fact that there is private Cluster Network beyond the Login Node and the clientEndpoint is not proxied right to the Matlab-client machine(Desktop Windows).
Is there any known issue about it ? Or am i missing some configuration in the generic profile?
Thanks in advance
Best Regards
Farhad

댓글을 달려면 로그인하십시오.

카테고리

Help CenterFile Exchange에서 Cluster Configuration에 대해 자세히 알아보기

제품


릴리스

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by