Question to convergence of q0 and average reward

조회 수: 6 (최근 30일)
Kun Cheng
Kun Cheng 2023년 7월 22일
답변: Rishi 2024년 1월 4일
Hi guys,
I am training a ddqn-agent. Average Reward can get converged about 1000 episodes. but q0 value need 3000 episodes more to get converged. can i stop the training after convergence of Average Reward?
If not, how can i accelerate convergence of q0. as i know, q0 gives the prediction of target critic. How can i change the frequency of target critic update?
second question is: q0 converged at place which bigger than max reward. how can i fix the problem
just like this. max reward is around -6. but q0 is -3.
  댓글 수: 1
Ayush Aniket
Ayush Aniket 2023년 8월 24일
Can you share your code? I will try to replicate it on my end. It will help me answer the question.

댓글을 달려면 로그인하십시오.

채택된 답변

Rishi
Rishi 2024년 1월 4일
Hi Kun,
I understand from your query that you want to know if you can stop the training after average reward converges, and how to accelerate the convergence of q0.
Stopping training after the convergence of the average reward might be tempting, but it's essential to ensure that your Q-values (such as Q0) have also converged. The Q-values represent the expected return of taking an action in a given state and following a particular policy thereafter. If these values have not converged, it might mean that your agent hasn't learned the optimal policy yet, even if the average reward seems stable.
To accelerate the convergence of Q0, you can try the following steps:
  • Learning rate adjustment: You can try to adjust the learning rate of your optimizer. A smaller learning rate can lead to more stable but slower convergence, whereas a larger learning rate can speed up the convergence but might overshoot the optimal values. You can change the learning rate of the agent in the following way:
agent.AgentOptions.CriticOptimizerOptions.LearnRate= lr;
In addition to these, you can try other Reinforcement Learning techniques such as Reward Shaping, Exploration Strategies, and Regularization Techniques like dropout or L2 regularization.
If the Q0 value is converging to a value greater than the maximum possible reward, this might be a sign of overestimation bias. To address this, you can try the following methods:
  • Clipping rewards: Clip the rewards during the training to prevent excessively high values.
  • Huber Loss: Instead of mean squared error for the loss function, try using Huber loss, which is less sensitive to outliers.
  • Regularization: Use regularization techniques to prevent the network from assuming to high values to Q-estimates.
Hope this helps!

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 Reinforcement Learning에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by