Question to convergence of q0 and average reward

Question

Kun Cheng 2023년 7월 22일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1999373-question-to-convergence-of-q0-and-average-reward

답변: Rishi 2024년 1월 4일

Hi guys,

I am training a ddqn-agent. Average Reward can get converged about 1000 episodes. but q0 value need 3000 episodes more to get converged. can i stop the training after convergence of Average Reward?

If not, how can i accelerate convergence of q0. as i know, q0 gives the prediction of target critic. How can i change the frequency of target critic update?

second question is: q0 converged at place which bigger than max reward. how can i fix the problem

just like this. max reward is around -6. but q0 is -3.

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

Ayush Aniket 2023년 8월 24일

Hi Kun Cheng,

Can you share your code? I will try to replicate it on my end. It will help me answer the question.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Rishi 2024년 1월 4일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1999373-question-to-convergence-of-q0-and-average-reward#answer_1382936

MATLAB Online에서 열기

Hi Kun,

I understand from your query that you want to know if you can stop the training after average reward converges, and how to accelerate the convergence of q0.

Stopping training after the convergence of the average reward might be tempting, but it's essential to ensure that your Q-values (such as Q0) have also converged. The Q-values represent the expected return of taking an action in a given state and following a particular policy thereafter. If these values have not converged, it might mean that your agent hasn't learned the optimal policy yet, even if the average reward seems stable.

To accelerate the convergence of Q0, you can try the following steps:

Learning rate adjustment: You can try to adjust the learning rate of your optimizer. A smaller learning rate can lead to more stable but slower convergence, whereas a larger learning rate can speed up the convergence but might overshoot the optimal values. You can change the learning rate of the agent in the following way:

agent.AgentOptions.CriticOptimizerOptions.LearnRate= lr;

You can find more information about ‘rlDQNAgent’, ‘rlDQNAgentOptions’ and ‘rlOptimizerOptions’ from the below documentations: https://www.mathworks.com/help/reinforcement-learning/ref/rl.agent.rldqnagent.html https://www.mathworks.com/help/reinforcement-learning/ref/rldqnagentoptions.html https://www.mathworks.com/help/reinforcement-learning/ref/rl.option.rloptimizeroptions.html
Experience replay: Make effective use of experience replay. By sampling from diverse set of past experiences, the agent can learn more efficiently. You can learn more about it from the below documentation: https://www.mathworks.com/help/reinforcement-learning/ref/rl.replay.rlreplaymemory.html
Target update frequency: Try changing the ‘TargetUpdateFrequency’ parameter of the ‘rlDQNAgentOptions’ function to change the frequency of the target network update. Increasing the frequency can lead to faster learning but might reduce the stability of the learning process. You can learn more about ‘Target Update Methods’ from the given link: https://www.mathworks.com/help/reinforcement-learning/ug/dqn-agents.html#mw_46a25460-7793-4671-9169-1075b5ea3f3e

In addition to these, you can try other Reinforcement Learning techniques such as Reward Shaping, Exploration Strategies, and Regularization Techniques like dropout or L2 regularization.

If the Q0 value is converging to a value greater than the maximum possible reward, this might be a sign of overestimation bias. To address this, you can try the following methods:

Clipping rewards: Clip the rewards during the training to prevent excessively high values.
Huber Loss: Instead of mean squared error for the loss function, try using Huber loss, which is less sensitive to outliers.
Regularization: Use regularization techniques to prevent the network from assuming to high values to Q-estimates.

Hope this helps!

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Question to convergence of q0 and average reward

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

Question to convergence of q0 and average reward

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기