The parallel cluster becomes unresponsive, while the program runs normally on local pool workers. How can this issue be resolved?

조회 수: 1 (최근 30일)
I have developed a program that utilizes parfor to run parallel workers. The structure of the program is as follows:
parfor ix = 1:numel(param_files)
% Read the parameter files
% Run fmincon optimization
end
The program functioned normally with all param_files when executed using local workers. However, when I attempted to run it on a parallel cluster, it became unresponsive for many hours while processing one of the parameter files during the optimization phase. As a result, I had to manually stop the parallel cluster. (Note: The hardware capabilities of the parallel cluster server are equivalent to those of my local PC.)
  댓글 수: 1
Sam Marshalik
Sam Marshalik 2024년 10월 7일
Hey Dung, if the remote workers become unresponsive, it is generally due to some resource issue. Have you had a chance to monitor the resources on your cluster when the problematic parameter file is being processed? Is it maxing out CPU or Memory?
Something to consider is that you may be starting more workers on your cluster than you maybe should. Meaning, each worker may not have access to sufficient CPU/Memory - we suggest 1 worker per 1 physical CPU core, but you may need to run less if your work is resource intensive.

댓글을 달려면 로그인하십시오.

답변 (1개)

Venkat Siddarth Reddy
Venkat Siddarth Reddy 2024년 10월 5일
Hi Dung,
I understand that the parallel cluster becomes unresponsive when ran the program utilizing the parellel workers.
To further troubleshoot issue, please consider performing the following steps:
  • Cluster Configuration: Ensure that the cluster is properly configured to handle the workload. Check the settings for the number of workers, memory allocation, and time limits.Verify that the cluster has access to all necessary files and resources. Sometimes file paths or dependencies might not be correctly set up on the cluster.
  • I/O Access and speed: If the parameter files are large or numerous, ensure that data transfer to and from the cluster is not a bottleneck.Consider using distributed data storage that the cluster can access efficiently.
  • Parallel Overhead: While local execution might handle overhead seamlessly, clusters can introduce additional overhead due to communication between nodes. This can be especially problematic if the tasks are not sufficiently large or complex to justify parallel execution. Please verify if the parallel overhead is significantly large.
  • Optimization Behavior: The fmincon optimization might behave differently on the cluster due to differences in floating-point arithmetic or other environmental factors. Check if the optimization problem is well-conditioned and robust to such changes.
  • Debugging and Logging: Implement logging within the parfor loop to track progress and identify which parameter file causes the hang-up. Use MATLAB’s debugging tools to isolate the issue. Consider running a smaller subset of parameter files to see if the problem is specific to certain inputs.
  • Cluster-Specific Issues: Check for any cluster-specific issues such as network latency, node failures, or resource contention.
I hope the above steps helps you in resolving the issue!

카테고리

Help CenterFile Exchange에서 MATLAB Parallel Server에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by