PPO algorithm training problem in Reinforcement Learning Toolbox

Question

DQ LEE 2023년 6월 28일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1989158-ppo-algorithm-training-problem-in-reinforcement-learning-toolbox

댓글: 轩 2023년 12월 31일

In the PPO training algorithm , here mentioned “For each experience sequence that does not contain a terminal state, N is equal to the ExperienceHorizon option value. Otherwise, N is less than ExperienceHorizon and SN is the terminal state.” ,

Here's my question :When N is smaller than ExperienceHorizon and N is also smaller than the size of mini-batch data, and this continues for multiple consecutive episodes, When does the algorithm update the parameters in this case?

AND another one question is :When will the PPO parameter be updated under the following parameter Settings:

agentOpts = rlPPOAgentOptions(...

'ExperienceHorizon',10000,...

'MiniBatchSize',64,...

'NumEpoch',3,...)

trainOpts = rlTrainingOptions(...

'MaxEpisodes',10000,...

'MaxStepsPerEpisode',30,... )

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Takeshi Takahashi 2023년 7월 5일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1989158-ppo-algorithm-training-problem-in-reinforcement-learning-toolbox#answer_1267773

When N is smaller than ExperienceHorizon and N is also smaller than MiniBatchSize, the PPO agent uses N experiences to update its parameters at the end of the episode.

So, if MaxStepsPerEpisode = 30, ExperienceHorizon = 10000, and MiniBatchSize is 64, the PPO agent uses 30 or fewer experiences (when the episode terminates early) to update its parameters at the end of each episode.

댓글 수: 2
없음 표시없음 숨기기

轩 2023년 12월 31일

So who deside the value N when the episode does not be stopped by reaching ExperienceHorizon and terminal state ?

Thank you for your explanation in advace.

轩 2023년 12월 31일

Maybe I have found the answer in the document Create Policies and Value Functions - MATLAB & Simulink - MathWorks Benelux

"When using PG agents, the learning trajectory length (that is the sequence of input data that the network uses for learning) for the RNN is the whole episode. For an AC agent, the NumStepsToLookAhead property of its options object is treated as the training trajectory length (except when training in parallel, in which case NumStepsToLookAhead is ignored and the whole episode is used as trajectory length). For a PPO agent, the trajectory length is the MiniBatchSize property of its options object."

댓글을 달려면 로그인하십시오.

PPO algorithm training problem in Reinforcement Learning Toolbox

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 2
없음 표시없음 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

Community Treasure Hunt

PPO algorithm training problem in Reinforcement Learning Toolbox

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 2 없음 표시없음 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 2
없음 표시없음 숨기기