Issue with Q0 Convergence during Training using PPO Agent

Question

Muhammad Fairuz Abdul Jalal 2023년 7월 10일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1993918-issue-with-q0-convergence-during-training-using-ppo-agent

편집: Emmanouil Tzorakoleftherakis 2023년 7월 12일

Hi guys,

I have developed my model and trained using PPO agent. Overall, the training process has been successful. However, I have encountered an issue with the Q0 values. The maximum achieable rewards is 6000. I set to stop my training at 98.5% of the maximum rewards (5910).

During the training, I have noticed that the Q0 values did not converge as expected. In fact, they seem to be capped at 100, as indicated by the figures. I am currently seeking an explanation for this behavior and trying to understand why the Q0 values are not reaching the desired convergence.

My agent option is as follow:

If anyone has any insights or explanations regarding the behavior of Q0 during training with the PPO agent, I would greatly appreciate your input. Your expertise and guidance would be invaluable in helping me understanding and addressing this issue.

Thank you.

댓글 수: 2
없음 표시없음 숨기기

Emmanouil Tzorakoleftherakis 2023년 7월 10일

Can you share the code with the training options?

Muhammad Fairuz Abdul Jalal 2023년 7월 11일

Thanks @Emmanouil Tzorakoleftherakis for the reply.

As per request. The snapshot of the code.

The action is set between -1 and 1. However, in the model, each action has their own gain.

The critic

The actor

Training Options

Thank you in advance. Really appreciate your help and support.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Emmanouil Tzorakoleftherakis 2023년 7월 11일

2
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1993918-issue-with-q0-convergence-during-training-using-ppo-agent#answer_1271188

편집: Emmanouil Tzorakoleftherakis 2023년 7월 12일

It seems you set the training to stop when the episode reward reaches the value of 0.985*(Tf/Ts)*3. I cannot comment on the value itself, but usually it's better to use average reward values as an indicator of when to stop training because it will helps filter out outlier episodes.

Aside fromt hat, in case it wasn't clear, the stopping criteria is not based on Q0, but the light blue value (individual episode reward) that you see in the plots you shared above. The value of Q0 will get better based on how well the critic is trained, but it does not necessarily need to "converge" in order to stop training. Better critic means better more stable training, but at the end of the day you only care about your actor. This is usually why it takes a few trials to see what stopping critiria make sense.

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

Muhammad Fairuz Abdul Jalal 2023년 7월 11일

Hi @Emmanouil Tzorakoleftherakis,

Thank you for highlighting the better way for stopping criteria. I will do the changes accordingly. Will update here soon.

댓글을 달려면 로그인하십시오.

Issue with Q0 Convergence during Training using PPO Agent

댓글 수: 2
없음 표시없음 숨기기

채택된 답변

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

Issue with Q0 Convergence during Training using PPO Agent

댓글 수: 2 없음 표시없음 숨기기

채택된 답변

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기