RL agent does not learn properly
조회 수: 24 (최근 30일)
이전 댓글 표시
Hello together
I am trying to learn about the Reinforcement Learning Toolbox and want to control the speed of a DC motor using an RL agent to replace a PI controller. I have oriented myself to the example of the water tank. However, I am having some problems learning.
First, I have problems with the agent adjusting itself to arrive at either minimum (0rpm) or maximum(6000rpm) and then not changing its state, even though it was able to achieve a good reward in its own episodes before.
In my reward function, I have specified the error of the target and actual speed as a percentage. When I try to punish him in the reward function so that he doesn't stay at 0rpm anymore, he stays at 0rpm and doesn't try to explore the area. I also have trouble correcting the remaining control error.
In the following the code and some pictures
close all
%param
R = 7.03;
L = 1.04*10^-3;
J = 44.2*10^-7;
a = 2.45*10^-6;
Kn = 250*2*pi/60;
Km = 38.2*10^-3;
actInfo = rlNumericSpec([1 1],'LowerLimit', 0, 'UpperLimit', 24);
actInfo.Name = 'spannung';
obsInfo = rlNumericSpec([3 1],...
'LowerLimit',[-inf -inf -inf ]',...
'UpperLimit',[ inf inf inf]');
obsInfo.Name = 'observations';
obsInfo.Description = 'integrated error, error, and measured rpm';
env=rlSimulinkEnv("DCMotorRL2", 'DCMotorRL2/RL Agent',...
obsInfo,actInfo);
env.ResetFcn = @(in)localResetFcn(in);
Ts = 0.1; %agent sample time
Tf = 20; %simulation time
rng(0)
statePath = [
featureInputLayer(obsInfo.Dimension(1),Name="netObsIn")
fullyConnectedLayer(50)
reluLayer
fullyConnectedLayer(25,Name="CriticStateFC2")];
actionPath = [
featureInputLayer(actInfo.Dimension(1),Name="netActIn")
fullyConnectedLayer(25,Name="CriticActionFC1")];
commonPath = [
additionLayer(2,Name="add")
reluLayer
fullyConnectedLayer(1,Name="CriticOutput")];
criticNetwork = layerGraph();
criticNetwork = addLayers(criticNetwork,statePath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,commonPath);
criticNetwork = connectLayers(criticNetwork, ...
"CriticStateFC2", ...
"add/in1");
criticNetwork = connectLayers(criticNetwork, ...
"CriticActionFC1", ...
"add/in2");
criticNetwork = dlnetwork(criticNetwork);
figure
plot(criticNetwork)
critic = rlQValueFunction(criticNetwork,obsInfo,actInfo, ...
ObservationInputNames="netObsIn", ...
ActionInputNames="netActIn");
actorNetwork = [
featureInputLayer(obsInfo.Dimension(1))
fullyConnectedLayer(9) %3
tanhLayer
fullyConnectedLayer(actInfo.Dimension(1))
];
actorNetwork = dlnetwork(actorNetwork);
actor = rlContinuousDeterministicActor(actorNetwork,obsInfo,actInfo);
agent = rlDDPGAgent(actor,critic);
agent.SampleTime = Ts;
agent.AgentOptions.TargetSmoothFactor = 1e-3;
agent.AgentOptions.DiscountFactor = 1.0;
agent.AgentOptions.MiniBatchSize = 64;
agent.AgentOptions.ExperienceBufferLength = 1e6;
agent.AgentOptions.NoiseOptions.Variance = 0.8; %0.3
agent.AgentOptions.NoiseOptions.VarianceDecayRate = 1e-5; %-5
agent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-03;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-04;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;
trainOpts = rlTrainingOptions(...
MaxEpisodes=4000, ...
MaxStepsPerEpisode=ceil(Tf/Ts), ...
ScoreAveragingWindowLength=20, ...
Verbose=false, ...
Plots="training-progress",...
StopTrainingCriteria="AverageReward",...
StopTrainingValue=800,...
SaveAgentCriteria="EpisodeCount", ...
SaveAgentValue=600);
doTraining = true;
if doTraining
% Train the agent.
trainingStats = train(agent,env,trainOpts);
end
function in = localResetFcn(in)
% randomize reference signal
blk = sprintf('DCMotorRL2/omega_ref');
h=randi([2000,4000]);
in = setBlockParameter(in,blk,'Value',num2str(h));
%initial 1/min
% h=randi([2000,4000])*(2*pi)/60;
% blk = 'DCMotorRL2/DCMotor/Integrator1';
% in = setBlockParameter(in,blk,'InitialCondition',num2str(h));
end
댓글 수: 2
채택된 답변
Emmanouil Tzorakoleftherakis
2023년 3월 20일
Some comments:
1) 150 episodes is really not much, you need to let the training continue for a bit longer
2) There is no guarantee that the reward will always go up. It may go down as the agent explores and then it may be able to find a better policy along the way
3) Noise variance is critical with DDPG agent. Make sure this value is between 1-10% of your action range
4) Sample time of 0.1 seconds seems a bit too large for a motor control application
5) This example is doing FOC with RL, but you may be able to use it for general information:
댓글 수: 0
추가 답변 (0개)
참고 항목
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!