High fluctuation in Q0 value for TD3 agent while training.

Question

James Sorokhaibam 2024년 5월 12일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/2117901-high-fluctuation-in-q0-value-for-td3-agent-while-training

답변: Ronit 2024년 5월 23일

I am training a TD3 RL agent for pick and place robot. The reward function is, reward = exp(-E/d) where E is the total energy consumed where the trajectory is complete and d is the distance of the object from the end-effector. The training went smoothly while using DQN agent but it fails when DDPG, TD3 are used. What could be the reasion for this? I used the following code for agent creation.

obsInfo = rlNumericSpec([34 1]);

actInfo = rlNumericSpec([14 1], ...

LowerLimit=-1, ...

UpperLimit= 1);

env = rlFunctionEnv(obsInfo,actInfo,"KondoStepFunction","KondoResetFunction");

agent = rlTD3Agent(obsInfo,actInfo);

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Ronit 2024년 5월 23일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/2117901-high-fluctuation-in-q0-value-for-td3-agent-while-training#answer_1462046

Hello James,

To understand why there are high fluctuations while using different RL agents, firstly we need to understand how these agents work.

The primary difference between DQN and agents like DDPG and TD3 is that DQN is just a value-based learning method, whereas DDPG and TD3 use the actor-critic method.
The DQN network tries to predict the Q values for each state-action pair, so it is just a single model. On the other hand, DDPG has a critic model that determines the Q value but uses the actor model to determine the action to take. Hence, we can say DDPG tries to directly learn the policy whereas DQN learns the Q values which are used to define the policy, generally an epsilon-greedy policy.
So, training an agent with DDPG or TD3 must be done more carefully. Not only because its learning is sometimes unstable, but because the number of hyperparameters to fine-tune in it is pretty much double that of DQN.

Here are a few suggestions which can help in getting good results using TD3 or DDPG agents:

Tune Hyperparameters: Adjust learning rates, replay buffer size, and exploration noise.
Normalize Rewards: Consider scaling your reward to reduce variability and improve learning stability.
Monitor Training: Use diagnostics to understand action, reward, and learning dynamics better.

Adjusting these aspects can help mitigate the high fluctuation and improve your TD3 agent's training performance.

Hope this helps!

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

High fluctuation in Q0 value for TD3 agent while training.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

답변 (1개)

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

High fluctuation in Q0 value for TD3 agent while training.

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

답변 (1개)

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기