Parallel workers automatically shutting down in the middle of RL parallel training.

Question

Matteo D'Ambrosio 2023년 5월 10일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1960844-parallel-workers-automatically-shutting-down-in-the-middle-of-rl-parallel-training

댓글: Matteo D'Ambrosio 2023년 5월 11일

Hello,

I am currently training a reinforcement learning PPO agent on a Simulink model with UseParallel=true. The total episodes that it should be running for are 5000 (about 10/11 hours of training), but i'm noticing that as the training goes on, more and more workers of the parallel pool are automatically shutting down, making training slower and slower as it progresses. I start with 8 workers, and they consistently decrease one at a time, until errors are generated.

I've been noticing this consistently in each training that i do, and would like to know if there are any workarounds.

For the parpool, i am letting Matlab start it automatically with all options set to default. I have also tried playing around with the number of workers, but the same thing happens.

댓글 수: 2
없음 표시없음 숨기기

Emmanouil Tzorakoleftherakis 2023년 5월 10일

What errors are you seeing? Maybe try training on a single worker initially to make sure you don't see any errors before moving to parallel.

Matteo D'Ambrosio 2023년 5월 10일

편집: Matteo D'Ambrosio 2023년 5월 10일

On a single worker everything works fine, and the errors i get happen after 3000+ training episodes, after the workers have slowly started shutting down one at a time. I've also used the environment validation function.

I've noticed that after training for this number of episodes, i also get the following error (tied to the PPO algorithm):

Dot indexing is not supported for variables of this type.

Error in rl.agent.rlPPOAgent/learnFromAdvantages (line 40)

advantageData.Advantages,this.AdvantageBuffer_,...

Error in rl.train.parallel.AsyncPPOParallelTrainer/processSimOutput_ (line 62)

learnData = learnFromAdvantages(agent,data.AdvantageData);

Error in rl.train.parallel.AsyncParallelTrainer/processFutures_ (line 22)

processSimOutput_(this,out);

Error in rl.train.parallel.AbstractParallelTrainer/run (line 64)

[F,outs,taskIDs] = processFutures_(this,F);

Error in rl.train.TrainingManager/train (line 479)

run(trainer);

Error in rl.train.TrainingManager/run (line 233)

train(this);

Error in rl.agent.AbstractAgent/train (line 136)

trainingResult = run(trainMgr,checkpoint);

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Edric Ellis 2023년 5월 11일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1960844-parallel-workers-automatically-shutting-down-in-the-middle-of-rl-parallel-training#answer_1232774

MATLAB Online에서 열기

If workers are leaving the pool one at a time while the pool is busy, this almost certainly means that they are crashing. I recommend contacting MathWorks support for help diagnosing and resolving this problem. You could also check by running the following location to see if any "matlab_crash_dump.*" files have been left behind:

jsl = parcluster().JobStorageLocation
dir(fullfile(jsl, '**', 'matlab_crash_dump.*'))

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

Matteo D'Ambrosio 2023년 5월 11일

Thanks for the help, i will contact MathWorks support to look into this further.

댓글을 달려면 로그인하십시오.

Parallel workers automatically shutting down in the middle of RL parallel training.

댓글 수: 2
없음 표시없음 숨기기

채택된 답변

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

Parallel workers automatically shutting down in the middle of RL parallel training.

댓글 수: 2 없음 표시없음 숨기기

채택된 답변

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기