generateRewardFunction
Generate a reward function from control specifications to train a reinforcement learning agent
Since R2021b
Syntax
Description
generateRewardFunction(
generates a
MATLAB® reward function based on the cost and constraints defined in the linear or
nonlinear MPC object mpcobj
)mpcobj
. The generated reward function is displayed
in a new editor window and you can use it as a starting point for reward design. You can
tune the weights, use a different penalty function, and then use the resulting reward
function within an environment to train an agent.
This syntax requires Model Predictive Control Toolbox™ software.
generateRewardFunction(
generates a
MATLAB reward function based on performance constraints defined in the model
verification blocks specified in the array of block paths blks
)blks
.
This syntax requires Simulink® Design Optimization™ software.
generateRewardFunction(___,'FunctionName',
specifies the name of the generated reward function, and saves it into a file with the same
name. It also overwrites any preexisting file with the same name in the current directory.
Provide this name after either of the previous input arguments.myFcnName
)
Examples
Generate a Reward Function from MPC Object
This example shows how to generate a reinforcement learning reward function from an MPC object.
Define Plant and Create MPC Controller
Create a random plant using the rss
function and set the feedthrough matrix to zero.
plant = rss(4,3,2); plant.d = 0;
Specify which of the plant signals are manipulated variables, measured disturbances, measured outputs and unmeasured outputs.
plant = setmpcsignals(plant,"MV",1,"MD",2,"MO",[1 2],"UO",3);
Create an MPC controller with a sample time of 0.1 and prediction and control horizons of 10 and 3 steps, respectively.
mpcobj = mpc(plant,0.1,10,3);
-->"Weights.ManipulatedVariables" is empty. Assuming default 0.00000. -->"Weights.ManipulatedVariablesRate" is empty. Assuming default 0.10000. -->"Weights.OutputVariables" is empty. Assuming default 1.00000. for output(s) y1 and zero weight for output(s) y2 y3
Set limits and a scale factor for the manipulated variable.
mpcobj.ManipulatedVariables.Min = -2; mpcobj.ManipulatedVariables.Max = 2; mpcobj.ManipulatedVariables.ScaleFactor = 4;
Set weights for the quadratic cost function.
mpcobj.Weights.OutputVariables = [10 1 0.1]; mpcobj.Weights.ManipulatedVariablesRate = 0.2;
Generate the Reward Function
Generate the reward function code from specifications in the mpc
object using generateRewardFunction
. The code is displayed in the MATLAB Editor.
generateRewardFunction(mpcobj)
For this example, the reward function is saved in the MATLAB® function file myMpcRewardFcn.m
. Display the generated reward function.
type myMpcRewardFcn.m
function reward = myMpcRewardFcn(y,refy,mv,refmv,lastmv) % MYMPCREWARDFCN generates rewards from MPC specifications. % % Description of input arguments: % % y : Output variable from plant at step k+1 % refy : Reference output variable at step k+1 % mv : Manipulated variable at step k % refmv : Reference manipulated variable at step k % lastmv : Manipulated variable at step k-1 % % Limitations (MPC and NLMPC): % - Reward computed based on first step in prediction horizon. % Therefore, signal previewing and control horizon settings are ignored. % - Online cost and constraint update is not supported. % - Custom cost and constraint specifications are not considered. % - Time varying cost weights and constraints are not supported. % - Mixed constraint specifications are not considered (for the MPC case). % Reinforcement Learning Toolbox % 27-May-2021 14:47:29 %#codegen %% Specifications from MPC object % Standard linear bounds as specified in 'States', 'OutputVariables', and % 'ManipulatedVariables' properties ymin = [-Inf -Inf -Inf]; ymax = [Inf Inf Inf]; mvmin = -2; mvmax = 2; mvratemin = -Inf; mvratemax = Inf; % Scale factors as specified in 'States', 'OutputVariables', and % 'ManipulatedVariables' properties Sy = [1 1 1]; Smv = 4; % Standard cost weights as specified in 'Weights' property Qy = [10 1 0.1]; Qmv = 0; Qmvrate = 0.2; %% Compute cost dy = (refy(:)-y(:)) ./ Sy'; dmv = (refmv(:)-mv(:)) ./ Smv'; dmvrate = (mv(:)-lastmv(:)) ./ Smv'; Jy = dy' * diag(Qy.^2) * dy; Jmv = dmv' * diag(Qmv.^2) * dmv; Jmvrate = dmvrate' * diag(Qmvrate.^2) * dmvrate; Cost = Jy + Jmv + Jmvrate; %% Penalty function weight (specify nonnegative) Wy = [1 1 1]; Wmv = 10; Wmvrate = 10; %% Compute penalty % Penalty is computed for violation of linear bound constraints. % % To compute exterior bound penalty, use the exteriorPenalty function and % specify the penalty method as 'step' or 'quadratic'. % % Alternaltively, use the hyperbolicPenalty or barrierPenalty function for % computing hyperbolic and barrier penalties. % % For more information, see help for these functions. % % Set Pmv value to 0 if the RL agent action specification has % appropriate 'LowerLimit' and 'UpperLimit' values. Py = Wy * exteriorPenalty(y,ymin,ymax,'step'); Pmv = Wmv * exteriorPenalty(mv,mvmin,mvmax,'step'); Pmvrate = Wmvrate * exteriorPenalty(mv-lastmv,mvratemin,mvratemax,'step'); Penalty = Py + Pmv + Pmvrate; %% Compute reward reward = -(Cost + Penalty); end
The calculated reward depends only on the current values of the plant input and output signals and their reference values, and it is composed of two parts.
The first is a negative cost that depends on the squared difference between desired and current plant inputs and outputs. This part uses the cost function weights specified in the MPC object. The second part is a penalty that acts as a negative reward whenever the current plant signals violate the constraints.
The generated reward function is a starting point for reward design. You can tune the weights or use a different penalty function to define a more appropriate reward for your reinforcement learning agent.
Generate Reward Function from Verification Block
This example shows how to generate a reinforcement learning reward function from a Simulink Design Optimization model verification block.
For this example, open the Simulink model LevelCheckBlock.slx
, which contains a Check Step Response Characteristics block named Level Check.
open_system("LevelCheckBlock")
Generate the reward function code from specifications in the Level Check block, using generateRewardFunction
. The code is displayed in the MATLAB Editor.
generateRewardFunction("LevelCheckBlock/Level Check")
For this example, the code is saved in the MATLAB function file myBlockRewardFcn.m
.
Display the generated reward function.
type myBlockRewardFcn.m
function reward = myBlockRewardFcn(x,t) % MYBLOCKREWARDFCN generates rewards from Simulink block specifications. % % x : Input of LevelCheckBlock/Level Check % t : Simulation time (s) % Reinforcement Learning Toolbox % 27-May-2021 16:45:27 %#codegen %% Specifications from LevelCheckBlock/Level Check Block1_InitialValue = 1; Block1_FinalValue = 2; Block1_StepTime = 0; Block1_StepRange = Block1_FinalValue - Block1_InitialValue; Block1_MinRise = Block1_InitialValue + Block1_StepRange * 80/100; Block1_MaxSettling = Block1_InitialValue + Block1_StepRange * (1+2/100); Block1_MinSettling = Block1_InitialValue + Block1_StepRange * (1-2/100); Block1_MaxOvershoot = Block1_InitialValue + Block1_StepRange * (1+10/100); Block1_MinUndershoot = Block1_InitialValue - Block1_StepRange * 5/100; if t >= Block1_StepTime if Block1_InitialValue <= Block1_FinalValue Block1_UpperBoundTimes = [0,5; 5,max(5+1,t+1)]; Block1_UpperBoundAmplitudes = [Block1_MaxOvershoot Block1_MaxOvershoot; Block1_MaxSettling Block1_MaxSettling]; Block1_LowerBoundTimes = [0,2; 2,5; 5,max(5+1,t+1)]; Block1_LowerBoundAmplitudes = [Block1_MinUndershoot Block1_MinUndershoot; Block1_MinRise Block1_MinRise; Block1_MinSettling Block1_MinSettling]; else Block1_UpperBoundTimes = [0,2; 2,5; 5,max(5+1,t+1)]; Block1_UpperBoundAmplitudes = [Block1_MinUndershoot Block1_MinUndershoot; Block1_MinRise,Block1_MinRise; Block1_MinSettling Block1_MinSettling]; Block1_LowerBoundTimes = [0,5; 5,max(5+1,t+1)]; Block1_LowerBoundAmplitudes = [Block1_MaxOvershoot Block1_MaxOvershoot; Block1_MaxSettling Block1_MaxSettling]; end Block1_xmax = zeros(1,size(Block1_UpperBoundTimes,1)); for idx = 1:numel(Block1_xmax) tseg = Block1_UpperBoundTimes(idx,:); xseg = Block1_UpperBoundAmplitudes(idx,:); Block1_xmax(idx) = interp1(tseg,xseg,t,'linear',NaN); end if all(isnan(Block1_xmax)) Block1_xmax = Inf; else Block1_xmax = max(Block1_xmax,[],'omitnan'); end Block1_xmin = zeros(1,size(Block1_LowerBoundTimes,1)); for idx = 1:numel(Block1_xmin) tseg = Block1_LowerBoundTimes(idx,:); xseg = Block1_LowerBoundAmplitudes(idx,:); Block1_xmin(idx) = interp1(tseg,xseg,t,'linear',NaN); end if all(isnan(Block1_xmin)) Block1_xmin = -Inf; else Block1_xmin = max(Block1_xmin,[],'omitnan'); end else Block1_xmin = -Inf; Block1_xmax = Inf; end %% Penalty function weight (specify nonnegative) Weight = 1; %% Compute penalty % Penalty is computed for violation of linear bound constraints. % % To compute exterior bound penalty, use the exteriorPenalty function and % specify the penalty method as 'step' or 'quadratic'. % % Alternaltively, use the hyperbolicPenalty or barrierPenalty function for % computing hyperbolic and barrier penalties. % % For more information, see help for these functions. Penalty = sum(exteriorPenalty(x,Block1_xmin,Block1_xmax,'step')); %% Compute reward reward = -Weight * Penalty; end
The generated reward function takes as input arguments the current value of the verification block input signals and the simulation time. A negative reward is calculated using a weighted penalty that acts whenever the current block input signals violate the linear bound constraints defined in the verification block.
The generated reward function is a starting point for reward design. You can tune the weights or use a different penalty function to define a more appropriate reward for your reinforcement learning agent.
Input Arguments
mpcobj
— Linear or nonlinear MPC object
mpc
object | nlmpc
object
Linear or nonlinear MPC object, specified as an mpc
(Model Predictive Control Toolbox) object or an nlmpc
(Model Predictive Control Toolbox) object,
respectively.
Note that:
The generated function calculates rewards using signal values at the current time only. Predicted future values, signal previewing, and control horizon settings are not used in the reward calculation.
Using time-varying cost weights and constraints, or updating them online, is not supported.
Only the standard quadratic cost function, as described in Optimization Problem (Model Predictive Control Toolbox), is supported. Therefore, for
mpc
objects, using mixed constraint specifications is not supported. Similarly, fornlmpc
objects, custom cost and constraint specifications are not supported.
Example: mpc(tf([1 1],[1 2 0]),0.1)
blks
— Path to model verification blocks
char
vector | cell
array | string
array
Path to model verification blocks, specified as character vector, cell array or string array. The supported Simulink Design Optimization model verification blocks are the following ones.
Check Against Reference (Simulink Design Optimization)
Check Custom Bounds (Simulink Design Optimization)
Check Step Response Characteristics (Simulink Design Optimization)
Any block belonging to the Simulink Model Verification (Simulink) library
The generated reward function takes as input arguments the current value of the verification block input signals and the simulation time. A negative reward is calculated using a weighted penalty that acts whenever the current block input signals violate the linear bound constraints defined in the verification block.
Example:
"mySimulinkModel02/Check Against Reference"
myFcnName
— Function name
string
| char
vector
Function name, specified as a string object or character vector.
Example: "reward03epf_step"
Tips
By default, the exterior bound penalty function exteriorPenalty
is
used to calculate the penalty. Alternatively, to calculate hyperbolic and barrier penalties,
you can use the hyperbolicPenalty
or
barrierPenalty
functions.
Version History
Introduced in R2021b
See Also
Functions
Objects
MATLAB 명령
다음 MATLAB 명령에 해당하는 링크를 클릭했습니다.
명령을 실행하려면 MATLAB 명령 창에 입력하십시오. 웹 브라우저는 MATLAB 명령을 지원하지 않습니다.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)