Reinforcement Learning agent converges to a suboptimal policy

조회 수: 15 (최근 30일)
Jeehwan Lee
Jeehwan Lee 2022년 11월 13일
Hello
I am trying to learn an multi-period optimal capacity planning problem. The system has 2 uncertainties that are stochastic, but Markovian and a third state which is the capacity. The benchmark is a single-period planning problem, which I have already solved with MINLP optimization.
I have tried many weeks with different agents, but so far I have not succeeded in getting the agent to learn correctly.
In the graph below (with actor critic) you can see that although it seems that learning takes place, the value is suboptimal (less than the single-period optimization value).
One of the uncertainties is demand. In theory, the agent should increase the capacity observing the demand as states. However at convergence, it does not properly do this.
Note that although I have defined the actions as discrete, not all actions are feasible. To compensate for this, I have clipped the actions as follows:
if TIME_P < DEPLOY_T
Action(Action>1-INS_CAP) = 1-INS_CAP; % If OPTION TO ABANDON is added, then [-CAP_UPPER+INS_CAP:5:CAP_UPPER-INS_CAP]
else
% t>DEPLOY_T
Action = 0;
end
Here, DEPLOY_T is the number of years the capacity planning actions can be exercised. The time-step continues to TERMINAL_P to account for more future cash flows.
I was wondering if anyone has any tips (@Emmanouil Tzorakoleftherakis 's answers on this forum has been particularly helpful, but no luck for me) or could possibly look at the code for me.

답변 (1개)

Emmanouil Tzorakoleftherakis
Emmanouil Tzorakoleftherakis 2023년 2월 13일
Hello,
In your question you mention a graph but it has not been attached?
It sounds like the agent you trained has converged to a suboptimal solution. If that's the case you probably need to tweak your reward a bit (make sure it is equivalent to your benchmark problem) and possibly make sure the agent is exploring throughout training. Starting simple with a DQN agent would help.The EpsilonDecay and EpsilonMin values are important for exploration (see here). You may also want to randomize the initial condition of your environment. That could help bypass the local solution you converged to.

카테고리

Help CenterFile Exchange에서 Policies and Value Functions에 대해 자세히 알아보기

제품


릴리스

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by