Error using parallel-cpu with trainnet function

Hi, I got this error when trying to train a neural network using the trainnet function. The first time I trained the network everything works fine, but when I tried to use validation data I got an error. Then I tried to retrain the network without validation data but I got this error:
Error detected on worker 3.
net = train(trainer, net, mbq);
Error in trainnet (line 42)
[net,info] = deep.internal.train.trainnet(mbq, net, loss, options, ...
Caused by:
Out of Memory during deserialization
I have deleted the Jobs of the cluster but nothing worked.
Can anybody help me to solve the issue?

댓글 수: 3

Saurabh
Saurabh 2024년 11월 22일
Hi @Ramiro,
Could you please share the details regarding the neural network and the memory allocation per worker? The minimum required memory is 4GB, with 8GB recommended. To optimize performance, consider increasing the memory available to each worker, which can typically be achieved by running fewer workers per compute node.
Ramiro
Ramiro 2024년 11월 22일
Hi @Saurabh, I have attached the files of the source code.
Regards
Ramiro
Also share the train and test MAT files.
They are required to run the script.

댓글을 달려면 로그인하십시오.

답변 (1개)

Sivsankar
Sivsankar 2025년 3월 7일

0 개 추천

Without the ability to execute the scripts, it is challenging to pinpoint the exact solution to this error. However, I can offer some troubleshooting steps that may assist you.
As indicated by the error message, this could be a memory-related issue concerning the workers. Ensure that sufficient memory is allocated per core on the cluster. Since the error occurred during network retraining, it is important to verify that the allocated memory is properly cleared.
Additionally, the error may suggest that the GPU lacks adequate memory to complete the network training. To mitigate memory requirements, consider decreasing the "MiniBatchSize" training option or reducing the size of your training dataset. Also, ensure that the GPU memory is cleared before initiating retraining.
I hope these suggestions prove helpful.

제품

릴리스

R2024a

질문:

2024년 11월 21일

답변:

2025년 3월 7일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by