Episode Q0 reward becoming extremely huge

Question

Muhammad Nadeem 2023년 10월 18일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/2035509-episode-q0-reward-becoming-extremely-huge

댓글: Muhammad Nadeem 2023년 11월 16일

Hello everyone,

I am building an LQR type control. The overall reward should be negative (the negative of LQR quadratic cost function), however during training the 'Episode Q0' reward is coming to be positive and extremely huge. Any ideas whats going wrong?

My summarized code is as follows:

%% Critic neural network
obsPath = featureInputLayer(obsInfo.Dimension(1),Name="obsIn"); 
actPath = featureInputLayer(actInfo.Dimension(1),Name="actIn");
commonPath = [
    concatenationLayer(1,2,Name="concat")
    quadraticLayer
    fullyConnectedLayer(1,Name="value", ...
    BiasLearnRateFactor=0,Bias=0)
    ];
% Add layers to layerGraph object
criticNet = layerGraph(obsPath);
criticNet = addLayers(criticNet,actPath);
criticNet = addLayers(criticNet,commonPath);
% Connect layers
criticNet = connectLayers(criticNet,"obsIn","concat/in1");
criticNet = connectLayers(criticNet,"actIn","concat/in2");
criticNet = dlnetwork(criticNet);
critic = rlQValueFunction(criticNet, ...
    obsInfo,actInfo, ...
    ObservationInputNames="obsIn",ActionInputNames="actIn");
getValue(critic,{rand(obsInfo.Dimension)},{rand(actInfo.Dimension)})
%% Actor neural network
Biass = zeros(6,1);  % no biasing linear actor
actorNet = [
    featureInputLayer(obsInfo.Dimension(1))
    fullyConnectedLayer(actInfo.Dimension(1), ... 
    BiasLearnRateFactor=0,Bias=Biass)
    ];
actorNet = dlnetwork(actorNet);
actor = rlContinuousDeterministicActor(actorNet,obsInfo,actInfo); 
agent = rlDDPGAgent(actor,critic);
getAction(agent,{rand(obsInfo.Dimension)}) 
%% create agent
agent = rlDDPGAgent(actor,critic);
agent.AgentOptions.SampleTime = env.Ts;
agent.AgentOptions.ExperienceBufferLength = 1e6;
agent.AgentOptions.MiniBatchSize = 128;
%agent.AgentOptions.NoiseOptions.StandardDeviation = 0.3;
agent.AgentOptions.NoiseOptions.StandardDeviation = [0.3; 0.9; 0.2; 0.8; 0.2; 0.6];
agent.AgentOptions.NoiseOptions.StandardDeviationDecayRate = 1e-3;
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 0.1;
agent.AgentOptions.ActorOptimizerOptions.L2RegularizationFactor =1e-4;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;
agent.AgentOptions.CriticOptimizerOptions.L2RegularizationFactor =1e-4;
agent.AgentOptions.CriticOptimizerOptions.LearnRate = 0.3;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
%% Initialize Agent Parameters
% Create diagonal matrix with negative eigenvalues
W = -(eye(SYS.nx_o+SYS.nu)+0.1);
idx = triu(true(SYS.nx_o+SYS.nu));
par = getLearnableParameters(agent);
par.Actor{1} = -ones(SYS.nu,SYS.nx_o);
par.Critic{1} = W(idx)';
setLearnableParameters(agent,par);
% Check the agent with a random observation input.
getAction(agent,{rand(obsInfo.Dimension)})
%% Train Agent
trainOpts = rlTrainingOptions(...
    MaxEpisodes=5000, ...
    MaxStepsPerEpisode=100, 
Verbose=false, ...
    Plots="training-progress",...
    StopTrainingCriteria="AverageReward",...
    StopTrainingValue=-0.65);
doTraining = true;
if doTraining
    % Train the agent.
    trainingStats = train(agent,env,trainOpts);
else
    % Load the pretrained agent for the example.
    load("Agent_NDAE.mat","agent");
end
%%

Thanks,

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

Zaid Jaber 2023년 11월 16일

did you get any answer @Muhammad Nadeem

Muhammad Nadeem 2023년 11월 16일

Nope, not yet

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Shivansh 2023년 10월 25일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/2035509-episode-q0-reward-becoming-extremely-huge#answer_1340211

Hi Muhmmad!

I understand that you are building a LQR control and getting a high positive value for ‘Episode Q0 reward’.

In an LQR (Linear Quadratic Regulator) control problem, the goal is to minimize a quadratic cost function. There can be a few potential areas in the provided code that might be causing the issue with the huge positive value for the Episode Q0 reward. Here are some suggestions to resolve this issue.

Try different initialization strategies, such as “Xavier” or “He” initialization, to see if it improves the training stability.
It seems that you have set the learning rate for the critic optimizer to 0.3 and the actor optimizer to 0.1. You can try experimenting with the learning rate values and observe the changes.
Although you mentioned that the average reward and episode rewards are negative as expected, it is worth double-checking the scaling of the reward values and setting them with respect to the network.
You have currently set experience buffer to “1e6” which should be enough, but it can be insufficient with respect to your problem. You can try changing the values and observe the changes on the overall training. If the buffer is too small, it might not capture enough diverse experiences, leading to suboptimal training.
The noise added to the actor's actions can also influence training. You have set the standard deviation for each action dimension individually. You can check if the noise values are appropriate for your problem and not too large, as it can result in unstable training and large rewards.

Review these aspects in your code and experiment with different configurations to identify the cause of the positive and extremely large Episode Q0 reward. Additionally, monitoring the critic's output during training and analysing the gradients can provide insights into the network's behaviour and potential issues.

If the issue persists, provide more information about the problem you are trying to solve, including the environment and specific requirements.

Hope it helps!

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

Muhammad Nadeem 2023년 10월 31일

Matlab codes.zip

Hello Shivansh,

Thank you for the details. I have tried all of your options but the problem still persists. The episode Q0 reward just doesnt make sense and becomes extremely huge and positive. What does episode Q0 signify can it be ignored?. According to my knowledge its a matric to tell how good the critic is given the initial observation of the enviroment.

I am attaching the details of my codes also if you want to have a look at them, please find it in the attachments.

댓글을 달려면 로그인하십시오.

Episode Q0 reward becoming extremely huge

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

답변 (1개)

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

Episode Q0 reward becoming extremely huge

댓글 수: 4 이전 댓글 2개 표시이전 댓글 2개 숨기기

답변 (1개)

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기