PPO强化学习时动作空间超出设定的数据范围

Question

郭欣 2023년 8월 15일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/2008852-ppo

댓글: 郭欣 2023년 8월 22일

채택된 답변: Aiswarya

在运用PPO方法进行强化学习训练时，我的动作空间设定如下：

numAct = 1;

actInfo = rlNumericSpec([numAct 1],"LowerLimit",-1,"UpperLimit",1);

actInfo.Name = "kvalue";

但是我在训练过程中运用SCOP发现动作输出严重超出了这个范围，这个原因是什么呢？

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Aiswarya 2023년 8월 22일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/2008852-ppo#answer_1291277

请注意，我将用英语回答这个问题。

(Please note that I will be answering the question in English.)

I understand that you are using the PPO agent and trying to set the LowerLimit and UpperLimit of data space as ActInfo using rlNumericSpec . But the action output is clearly not in the range specified by you.

The action bounds depend on the type of agent. On-policy agents like PPO don’t enforce constraints set in the action specification (specified using rlNumericSpec) for continuous action spaces . If you want to enforce these limits you have to do it explicitly on the environment side.

This is because these agents use Gaussian distributions (mean and standard deviation are outputs from the actor-network) to sample exploration actions. Mean is bounded by tanh and scaling layers, but we sample actions from the unbounded Gaussian distribution. Thus actions can be outside of the limits during training. One option is to set the agent.UseExplorationPolicy = false after training, so the agents can use only mean, and the actions are always within limits.

The rlPPOAgent documentation also mentions the fact that action bounds need to be set by the user within the environment:

Proximal policy optimization (PPO) reinforcement learning agent - MATLAB (mathworks.com)

Also note that this is not the case with agents like SAC, for which the action bounds can be enforced with `rlNumericSpec` .

I hope this helps.

Regards,

Aiswarya

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

郭欣 2023년 8월 22일

I got it ! thanks !!!

댓글을 달려면 로그인하십시오.

PPO强化学习时动作空间超出设定的数据范围

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

PPO强化学习时动作空间超出设定的数据范围

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기