Why is my DDPG agent converging to a state where it gets continuous penalization, while having a state it can go with 0 penalization?

Question

Francisco Serra 2024년 2월 19일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/2083918-why-is-my-ddpg-agent-converging-to-a-state-where-it-gets-continuous-penalization-while-having-a-sta

답변: Emmanouil Tzorakoleftherakis 2024년 2월 20일

I am training a Reinforcement Learning DDPG agent to drive a vehicle to a reference.

The vehicle dynamics are:

x_dot = v*cos(psi);
y_dot = v*sin(psi);
psi_dot = w;
v_dot = a;

Having as observations - obs=[e_x, e_y, e_psi, e_v] - and actions - u=[w (psi_dot); a (v_dot)]- , my DDPG agent is failing to get to the reference with 0 error.

Reward ate ach step --> rwd = -(x^T*Q*x + u^T*R*u) (I used the same as the LQR cost funtion to have a comparison)

No matter how I tune the hyper-parameters or make my actor and critic networks more or less complex, the gap is always there.

For a reason that I don't know, I remembered to remove all biases from the neurons of my networks, buiding actor and critic networks that just have weights, and that actually solved the problem. All the trainings, with different hyperparameters, drived the error to 0:

What I wanted to ask is:

1 - Why does removing the biases solve the problem?

2 - Despite driving the errors to 0, removing the bias terms resulted in a degradated performance comparing to an agent with bias terms that I luckily got by stoping the training at a moment where the weights happened to drive the error to 0 (if I had let the training go for 10 more episodes the gap would appear again, that's why I can't use that agent, there is no consistency). How can I get the agent with bias terms drive the error to 0?

I would really appreciate if anyone could answear me, because I can't seem to find an explanation for this.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Emmanouil Tzorakoleftherakis 2024년 2월 20일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/2083918-why-is-my-ddpg-agent-converging-to-a-state-where-it-gets-continuous-penalization-while-having-a-sta#answer_1412963

My guess is that this happens due to the specifics of the problem. You want to build a controller that generates 'zeroes' when the error inputs are zero. Removing the biases happens to make this much easier assuming your actor is a feedforward net (think of Y=W*X+B - if X is close to zero, Y will be close to zero even if W is not perfectly optimized. However B will completely shift the signal).

By the way, your reference here is constant - it would be much harder to achieve the same with a time-varying reference. In general it is much harder to consistently achieve zero tracking error with RL compared to a more traditional controller, because you would need to do a lot of training on 'low-error' inputs.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Why is my DDPG agent converging to a state where it gets continuous penalization, while having a state it can go with 0 penalization?

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

답변 (1개)

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

Why is my DDPG agent converging to a state where it gets continuous penalization, while having a state it can go with 0 penalization?

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

답변 (1개)

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기