RL DDPG agent not converging

조회 수: 28 (최근 30일)
Haochen
Haochen 2024년 11월 17일 18:36
Hi,
I am training a DDPG agent to control the single cart with an initial speed moving along a horizontal axis. The RL agent acts as a controller that provides the force in the direction of the axis to assist in its convergence to the origin. It should not be a difficult task , however, after training for many steps, the control effect is still far from optimal.
These are my configurations for the agent and the environment. The optimal policy should be for the force to be equal to zero, meaning that the cart should no longer be moving after it reaches the origin.
The agent by actor critic.
function [agents] = createDDPGAgents(N)
% Function to create two DDPG agents with the same observation and action info.
obsInfo = rlNumericSpec([2 1],'LowerLimit',-100*ones(2,1),'UpperLimit',100*ones(2,1));
actInfo = rlNumericSpec([N 1],'LowerLimit',-100*ones(N,1),'UpperLimit',100*ones(N,1));
% Define observation and action paths for critic
obsPath = featureInputLayer(prod(obsInfo.Dimension), Name="obsInLyr");
actPath = featureInputLayer(prod(actInfo.Dimension), Name="actInLyr");
% Define common path: concatenate along first dimension
commonPath = [
concatenationLayer(1, 2, Name="concat")
fullyConnectedLayer(30)
reluLayer
fullyConnectedLayer(1)
];
% Add paths to layerGraph network
criticNet = layerGraph(obsPath);
criticNet = addLayers(criticNet, actPath);
criticNet = addLayers(criticNet, commonPath);
% Connect paths
criticNet = connectLayers(criticNet, "obsInLyr", "concat/in1");
criticNet = connectLayers(criticNet, "actInLyr", "concat/in2");
% Plot the network
plot(criticNet)
% Convert to dlnetwork object
criticNet = dlnetwork(criticNet);
% Display the number of weights
summary(criticNet)
% Create the critic approximator object
critic = rlQValueFunction(criticNet, obsInfo, actInfo, ...
ObservationInputNames="obsInLyr", ...
ActionInputNames="actInLyr");
% Check the critic with random observation and action inputs
getValue(critic, {rand(obsInfo.Dimension)}, {rand(actInfo.Dimension)})
% Create a network to be used as underlying actor approximator
actorNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(30)
tanhLayer
fullyConnectedLayer(30)
tanhLayer
fullyConnectedLayer(prod(actInfo.Dimension))
];
% Convert to dlnetwork object
actorNet = dlnetwork(actorNet);
% Display the number of weights
summary(actorNet)
% Create the actor
actor = rlContinuousDeterministicActor(actorNet, obsInfo, actInfo);
%% DDPG Agent Options
agentOptions = rlDDPGAgentOptions(...
'DiscountFactor', 0.98, ...
'MiniBatchSize', 128, ...
'TargetSmoothFactor', 1e-3, ...
'ExperienceBufferLength', 1e6, ...
'SampleTime', -1);
%% Create Two DDPG Agents
agent1 = rlDDPGAgent(actor, critic, agentOptions);
agent2 = rlDDPGAgent(actor, critic, agentOptions);
% Return agents as an array
agents = [agent1, agent2];
agentOptions.NoiseOptions.MeanAttractionConstant = 0.1;
agentOptions.NoiseOptions.StandardDeviation = 0.3;
agentOptions.NoiseOptions.StandardDeviationDecayRate = 8e-4;
agentOptions.NoiseOptions
end
The envrionment:
function [nextObs, reward, isDone, loggedSignals] = myStepFunction(action, loggedSignals,S)
% Environment parameters
nextObs1 = S.A1d*loggedSignals.State + S.B1d*action(1);
nextObs = nextObs1;
loggedSignals.State = nextObs1;
if abs(loggedSignals.State(1))<=0.05 && abs(loggedSignals.State(2))<=0.05
reward1 = 10;
else
reward1 = -1*(1.01*(nextObs1(1))^2 + 1.01*nextObs1(2)^2 + action^2 );
if reward1 <= -1000
reward1 = -1000;
end
end
reward = reward1;
if abs(loggedSignals.State(1))<=0.02 && abs(loggedSignals.State(2))<=0.02
isDone = true;
else
isDone = false;
end
end
And this is the simulation setup (i omitted the reset function here, and S.N = 1):
obsInfo1 = rlNumericSpec([2 1],'LowerLimit',-100*ones(2,1),'UpperLimit',100*ones(2,1)) ;
actInfo1 = rlNumericSpec([N 1],'LowerLimit',-100*ones(N,1),'UpperLimit',100*ones(N,1));
stepFn1 = @(action, loggedSignals) myStepFunction(action, loggedSignals, S);
resetFn1 = @() myResetFunction(pos1);
env = rlFunctionEnv(obsInfo1, actInfo1, stepFn1, resetFn1);
%% Specify agent initialization
agent= createDDPGAgents(S.N);
loggedSignals = [];
trainOpts = rlTrainingOptions(...
StopOnError="on",...
MaxEpisodes=1000,... %1100 for fully trained
MaxStepsPerEpisode=1000,...
StopTrainingCriteria="AverageReward",...
StopTrainingValue=480,...
Plots="training-progress");
%"training-progress"
train(agent, env, trainOpts);
This is the reward plot wher it it taking very long time for each episode, bt still no signs of reaching the positive reward for this simple system.
And this is the control effect on both states, whichi shows that the RL agent is controlling the a cart to the wrong position near -1 while its velocity is 0.
It is very wierd that the reward is not converging to the positive reward one, but to another point. Can I ask where the problem could be. Thanks.
Haochen

답변 (0개)

카테고리

Help CenterFile Exchange에서 Reinforcement Learning에 대해 자세히 알아보기

제품


릴리스

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by