Reinforcement Leaning DQN Training Convergence Problem

조회 수: 4 (최근 30일)
Gülin Sayal
Gülin Sayal 2021년 6월 6일
답변: Darshak 2025년 4월 29일
Hi everyone,
I am designing an energy management system for a vehicle, and using DQN for optimizing fuel consumption. Here are some related lines from my code.
env = rlSimulinkEnv(mdl,agentblk,obsInfo,actInfo);
nI = obsInfo.Dimension(1);
nL = 24;
nO = numel(actInfo.Elements);
dnn = [
featureInputLayer(nI,'Name','state','Normalization','none')
fullyConnectedLayer(nL,'Name','fc1')
reluLayer('Name','relu1')
fullyConnectedLayer(nL,'Name','fc2')
reluLayer('Name','relu2')
fullyConnectedLayer(nO,'Name','output')];
criticOpts = rlRepresentationOptions('LearnRate',0.00025,'GradientThreshold',1);
critic = rlQValueRepresentation(dnn,obsInfo,actInfo,'Observation',{'state'},criticOpts);
agentOpts = rlDQNAgentOptions(...
'UseDoubleDQN',false, ...
'TargetUpdateMethod',"periodic", ...
'TargetUpdateFrequency',4, ...
'ExperienceBufferLength',1000, ...
'DiscountFactor',0.99, ...
'MiniBatchSize',32);
agentOptions.EpsilonGreedyExploration.Epsilon=1;
agentOptions.EpsilonGreedyExploration.EpsilonMin=0.2;
agentOptions.EpsilonGreedyExploration.EpsilonDecay=0.0050;
agentObj = rlDQNAgent(critic,agentOpts)
maxepisodes = 10000;
maxsteps = ceil(T/Ts);
trainingOpts = rlTrainingOptions('MaxEpisodes',10000,...
'MaxStepsPerEpisode',maxsteps,...
'Verbose',false,...
'Plots','training-progress',...
'StopTrainingCriteria','EpisodeReward',...
'StopTrainingValue', 0);
trainingStats = train(agentObj,env,trainingOpts)
The problem is that after training, rewards do not converge. Moreover, long-term estimated cumulative reward Q0 diverges. I already read some posts regarding the topic here, then I normalized my action and observation space which did not help. In addition to that, I also tried adding scaling layer right before the last fullyConnectedLayer which also did not help. You can find my training progress curves in attachment.
So, what can I try further so that Q0 does not diverge and episode rewards converge.
Also, I would really like to know how the Q0 is calculated. It is not possible for my model to have such big long-term estimated rewards.
Best Regards,
Gülin

답변 (1개)

Darshak
Darshak 2025년 4월 29일
Hello Gülin Sayal,
I understand that the model does not converge, which might be due to “Q₀”.
“Q₀” is the initial state-action value estimate computed from the target critic network at the beginning of each episode (typically at time step t=0).
Mathematically, for a state “s₀”, the agent computes:
Q= max_a Q_target(s, a)
Where:
  • Q_target is the target critic network, updated periodically.
  • max_a implies it selects the maximum Q-value over all possible actions at that state.
If Q₀ diverges, it indicates instability in the value estimates that might happen due to:
  • High learning rate
  • Poor network structure
  • Unnormalized input/output
  • Unstable reward scale
  • Improper target network update frequency
To resolve diverging Q₀ values and non-converging rewards in DQN training, you may refer to the steps mentioned below:
1. Scale rewards in the environment (e.g., divide by a constant) to keep them within a range like [-1, 1].
reward = rawReward / 100;
2. Use Double DQN to reduce Q-value overestimation.
agentOpts = rlDQNAgentOptions('UseDoubleDQN', true, ...);
You can refer to the following documentation to gain further understanding “rlDQNAgentOptions” function: https://www.mathworks.com/help/releases/R2021a/reinforcement-learning/ref/rldqnagentoptions.html
3. Add a tanhLayer after the output to bound Q-values between -1 and 1.
dnn = [
featureInputLayer(nI,'Name','state','Normalization','none')
fullyConnectedLayer(64,'Name','fc1')
reluLayer
fullyConnectedLayer(64,'Name','fc2')
reluLayer
fullyConnectedLayer(nO,'Name','output')
tanhLayer('Name','tanhOut')];
You can refer to the following documentation for more information on “tanhLayer” function: https://www.mathworks.com/help/releases/R2021a/deeplearning/ref/nnet.cnn.layer.tanhlayer.html
4. Reduce the critic's learning rate for stable updates.
criticOpts = rlRepresentationOptions('LearnRate',1e-4,'GradientThreshold',0.5);
5. Make target updates less frequent to reduce instability.
agentOpts.TargetUpdateFrequency = 20;
6. Use deeper networks to improve approximation capability.
% Replace 'nL = 24' with:
fullyConnectedLayer(64)
reluLayer
fullyConnectedLayer(64)
7. Ensure exploration reduces over time:
agentOpts.EpsilonGreedyExploration.Epsilon = 1;
agentOpts.EpsilonGreedyExploration.EpsilonMin = 0.1;
agentOpts.EpsilonGreedyExploration.EpsilonDecay = 0.01;
I hope this resolves the issue.

카테고리

Help CenterFile Exchange에서 Reinforcement Learning에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by