使用PPO和TRPO算法在reinforcement learning design app输出连续动作时，动作值不在设定好的区间内

Question

宝 2023년 2월 17일

1
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1914255-ppo-trpo-reinforcement-learning-design-app

답변: Aiswarya 2024년 2월 27일

%Open model

mdl='FCEV';

blk='FCEV/RL Agent';

%open_system(mdl);

%(s,a)

obsInfo = rlNumericSpec([3 1]);

obsInfo.Name = 'observations';

obsInfo.Description = 'A, B, and C';

numObservations = obsInfo.Dimension(1);

actInfo = rlNumericSpec([1 1],'LowerLimit',0,'UpperLimit',35);

actInfo.Name = 'Pbat_factor';

numActions = actInfo.Dimension(1);

%env

env = rlSimulinkEnv(mdl,blk,...

obsInfo,actInfo);

需要的动作区间是在0到35连续输出，但是在reinforcement learning design app中使用PPO或者TRPO算法输出的动作不在区间内，甚至一直输出负值。请问是网络结构设置有问题还是什么原因？

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Aiswarya 2024년 2월 27일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1914255-ppo-trpo-reinforcement-learning-design-app#answer_1417476

请注意，我将用英语回答这个问题。

(Please note that I will be answering the question in English.)

I understand that you are using the PPO/TRPO agent and trying to set the "LowerLimit" and "UpperLimit" of data space as "actInfo" using "rlNumericSpec" function. But the action output values are not in the range specified by you.

The action bounds depend on the type of agent. Both PPO and TRPO are on-policy agents and they don’t enforce constraints set in the action specification (specified using "rlNumericSpec") for continuous action spaces. If you want to enforce these limits you have to do it explicitly on the environment side.

You may refer to the "rlTRPOAgent" documentation which also mentions the fact that action bounds need to be set by the user within the environment:

https://www.mathworks.com/help/reinforcement-learning/ref/rl.agent.rltrpoagent.html

Also note that this is not the case with agents like SAC, for which the action bounds can be enforced with "rlNumericSpec".

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

使用PPO和TRPO算法在reinforcement learning design app输出连续动作时，动作值不在设定好的区间内

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

답변 (1개)

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

使用PPO和TRPO​算法在reinfor​cement learning design app输出连续动作时，动作值不在设定好的区间内

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

답변 (1개)

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

使用PPO和TRPO算法在reinforcement learning design app输出连续动作时，动作值不在设定好的区间内

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기