High fluctuation in Q0 value for TD3 agent while training.

조회 수: 6 (최근 30일)
James Sorokhaibam
James Sorokhaibam 2024년 5월 12일
답변: Ronit 2024년 5월 23일
I am training a TD3 RL agent for pick and place robot. The reward function is, reward = exp(-E/d) where E is the total energy consumed where the trajectory is complete and d is the distance of the object from the end-effector. The training went smoothly while using DQN agent but it fails when DDPG, TD3 are used. What could be the reasion for this? I used the following code for agent creation.
obsInfo = rlNumericSpec([34 1]);
actInfo = rlNumericSpec([14 1], ...
LowerLimit=-1, ...
UpperLimit= 1);
env = rlFunctionEnv(obsInfo,actInfo,"KondoStepFunction","KondoResetFunction");
agent = rlTD3Agent(obsInfo,actInfo);

답변 (1개)

Ronit
Ronit 2024년 5월 23일
Hello James,
To understand why there are high fluctuations while using different RL agents, firstly we need to understand how these agents work.
  • The primary difference between DQN and agents like DDPG and TD3 is that DQN is just a value-based learning method, whereas DDPG and TD3 use the actor-critic method.
  • The DQN network tries to predict the Q values for each state-action pair, so it is just a single model. On the other hand, DDPG has a critic model that determines the Q value but uses the actor model to determine the action to take. Hence, we can say DDPG tries to directly learn the policy whereas DQN learns the Q values which are used to define the policy, generally an epsilon-greedy policy.
  • So, training an agent with DDPG or TD3 must be done more carefully. Not only because its learning is sometimes unstable, but because the number of hyperparameters to fine-tune in it is pretty much double that of DQN.
Here are a few suggestions which can help in getting good results using TD3 or DDPG agents:
  1. Tune Hyperparameters: Adjust learning rates, replay buffer size, and exploration noise.
  2. Normalize Rewards: Consider scaling your reward to reduce variability and improve learning stability.
  3. Monitor Training: Use diagnostics to understand action, reward, and learning dynamics better.
Adjusting these aspects can help mitigate the high fluctuation and improve your TD3 agent's training performance.
Hope this helps!

카테고리

Help CenterFile Exchange에서 Reinforcement Learning에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by