Main Content

rlQValueFunction

Q-Value function approximator object for reinforcement learning agents

Description

This object implements a Q-value function approximator that you can use as a critic for a reinforcement learning agent. A Q-value function maps an environment state-action pair to a scalar value representing the predicted discounted cumulative long-term reward when the agent starts from the given state and executes the given action. A Q-value function critic therefore needs both the environment state and an action as inputs. After you create an rlQValueFunction critic, use it to create an agent such as an rlQAgent, rlDQNAgent, rlSARSAAgent, rlDDPGAgent, or rlTD3Agent agent. For more information on creating representations, see Create Policies and Value Functions.

Creation

Description

example

critic = rlQValueFunction(net,observationInfo,actionInfo) creates the Q-value function object critic and sets the ObservationInfo Here, net is the deep neural network used as an approximator, and must have both observation and action as inputs, and a single scalar output. The network input layers are automatically associated with the environment observation and action channels according to the dimension specifications in observationInfo and actionInfo. This function sets the ActionInfo properties of critic to the observationInfo and actionInfo input arguments, respectively.

example

critic = rlQValueFunction(net,observationInfo,actionInfo,ObservationInputNames=netObsNames,ActionInputNames=netActName) specifies the names of the network input layers to be associated with the environment observation and action channels. The function assigns, in sequential order, each environment observation channel specified in observationInfo to the layer specified by the corresponding name in the string array netObsNames, and the environment action channel specified in actionInfo to the layer specified by the string netActName. Therefore, the network input layers, ordered as the names in netObsNames, must have the same data type and dimensions as the observation specifications, as ordered in observationInfo. Furthermore, the network input layer indicated by netActName must have the same data type and dimensions as the action specifications defined in actionInfo.

example

critic = rlQValueFunction(tab,observationInfo,actionInfo) creates the Q-value function object critic with discrete action and observation spaces from the Q-value table tab. tab is a rlTable object containing a table with as many rows as the possible observations and as many columns as the possible actions. The function sets the ObservationInfo and ActionInfo properties of critic respectively to the observationInfo and actionInfo input arguments, which in this case must be scalar rlFiniteSetSpec objects.

example

critic = rlQValueFunction({basisFcn,W0},observationInfo,actionInfo) creates a Q-value function object critic using a custom basis function as underlying approximator. The first input argument is a two-element cell array whose first element is the handle basisFcn to a custom basis function and whose second element is the initial weight vector W0. Here the basis function must have both observation and action as inputs and W0 must be a column vector. The function sets the ObservationInfo and ActionInfo properties of critic to the observationInfo and actionInfo input arguments, respectively.

critic = rlQValueFunction(___,UseDevice=useDevice) specifies the device used to perform computational operations on the critic object, and sets the UseDevice property of critic to the useDevice input argument. You can use this syntax with any of the previous input-argument combinations.

Input Arguments

expand all

Deep neural network used as the underlying approximator within the critic. The network must have both the environment observation and action as inputs and a single scalar as output, and can be specified as one of the following:

Note

Among the different network representation options, dlnetwork is preferred, since it has built-in validation checks and supports automatic differentiation. If you pass another network object as an input argument, it is internally converted to a dlnetwork object. However, best practice is to convert other representations to dlnetwork explicitly before using it to create a critic or an actor for a reinforcement learning agent. You can do so using dlnet=dlnetwork(net), where net is any neural network object from the Deep Learning Toolbox™. The resulting dlnet is the dlnetwork object that you use for your critic or actor. This practice allows a greater level of insight and control for cases in which the conversion is not straightforward and might require additional specifications.

rlQValueFunction objects support recurrent deep neural networks.

The learnable parameters of the critic are the weights of the deep neural network. For a list of deep neural network layers, see List of Deep Learning Layers. For more information on creating deep neural networks for reinforcement learning, see Create Policies and Value Functions.

Network input layers names corresponding to the environment observation channels, specified as a string array or a cell array of character vectors. When you use the pair value arguments 'ObservationInputNames' with netObsNames and 'ActionInputNames' with netActName, the function assigns, in sequential order, each environment observation channel specified in observationInfo to each network input layer specified by the corresponding name in the string array netObsNames. It then assigns the environment action channel specified in actionInfo to the network input layer specified by the name in the string netActName. Therefore, the network input layers, ordered as the names in netObsNames, must have the same data type and dimensions as the observation specifications, as ordered in observationInfo. Furthermore, the network input layer indicated by netActName must have the same data type and dimensions as the action specifications defined in actionInfo.

Note

Of the information specified in observationInfo, the function uses only the data type and dimension of each channel, but not its (optional) name or description.

Example: {"NetInput1_airspeed","NetInput2_altitude"}

Network input layers name corresponding to the environment action channel. When you use the pair value arguments 'ObservationInputNames' with netObsNames and 'ActionInputNames' with netActName, the function assigns, in sequential order, each environment observation channel specified in observationInfo to each network input layer specified by the corresponding name in the string array netObsNames. It then assigns the environment action channel specified in actionInfo to the network input layer specified by the name in the string netActName. Therefore, the network input layers, ordered as the names in the string array netObsNames, must have the same data type and dimensions as the observation channels, as ordered in observationInfo. Furthermore, the network input layer indicated by netActName must have the same data type and dimensions as the action channel as defined in actionInfo.

Note

The function does not use the name or the description (if any) of the action channel specified in actionInfo.

Example: 'myNetOutput_Force'

Q-value table, specified as an rlTable object containing an array with as many rows as the possible observations and as many columns as the possible actions. The element (s,a) is the expected cumulative long-term reward for taking action a from observed state s. The elements of this array are the learnable parameters of the critic.

Custom basis function, specified as a function handle to a user-defined MATLAB function. The user defined function can either be an anonymous function or a function on the MATLAB path. The output of the critic is the scalar c = W'*B, where W is a weight vector containing the learnable parameters, and B is the column vector returned by the custom basis function.

Your basis function must have the following signature.

B = myBasisFunction(obs1,obs2,...,obsN,act)

Here, obs1 to obsN are inputs in the same order and with the same data type and dimensions as the environment observation channels defined in observationInfo and act is an input with the data type and dimension as the environment action channel defined in actionInfo.

For an example on how to use a basis function to create a Q-value function critic with a mixed continuous and discrete observation space, see Create Mixed Observation Space Q-Value Function Critic from Custom Basis Function.

Example: @(obs1,obs2,act) [act(2)*obs1(1)^2; abs(obs2(5)+act(1))]

Initial value of the basis function weights W, specified as a column vector having the same length as the vector returned by the basis function.

Properties

expand all

Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array containing a mix of such objects. Each element in the array defines the properties of an environment observation channel, such as its dimensions, data type, and name. Note that only the data type and dimension of a channel are used by the software to create actors or critics, but not its (optional) name.

rlQValueFucntion sets the ObservationInfo property of critic to the input argument observationInfo.

You can extract ObservationInfo from an existing environment or agent using getObservationInfo. You can also construct the specifications manually.

Action specifications, specified as an rlFiniteSetSpec or rlNumericSpec object. This object defines the properties of the environment action channel, such as its dimensions, data type, and name. Note that the function does not use the name of the action channel specified in actionInfo.

Note

Only one action channel is allowed.

rlQValueRepresentation sets the ActionInfo property of critic to the input actionInfo.

You can extract ActionInfo from an existing environment or agent using getActionInfo. You can also construct the specifications manually.

Computation device used to perform operations such as gradient computation, parameter update and prediction during training and simulation, specified as either "cpu" or "gpu".

The "gpu" option requires both Parallel Computing Toolbox™ software and a CUDA® enabled NVIDIA® GPU. For more information on supported GPUs see GPU Support by Release (Parallel Computing Toolbox).

You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be used with MATLAB®.

Note

Training or simulating an agent on a GPU involves device-specific numerical round-off errors. These errors can produce different results compared to performing the same operations a CPU.

To speed up training by using parallel processing over multiple cores, you do not need to use this argument. Instead, when training your agent, use an rlTrainingOptions object in which the UseParallel option is set to true. For more information about training using multicore processors and GPUs for training, see Train Agents Using Parallel Computing and GPUs.

Example: 'UseDevice',"gpu"

Object Functions

rlDDPGAgentDeep deterministic policy gradient reinforcement learning agent
rlTD3AgentTwin-delayed deep deterministic policy gradient reinforcement learning agent
rlDQNAgentDeep Q-network reinforcement learning agent
rlQAgentQ-learning reinforcement learning agent
rlSARSAAgentSARSA reinforcement learning agent
rlSACAgentSoft actor-critic reinforcement learning agent
getValueObtain estimated value from a critic given environment observations and actions
getMaxQValueObtain maximum estimated value over all possible actions from a Q-value function critic with discrete action space, given environment observations

Examples

collapse all

Create an observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment). For this example, define the observation space as a continuous four-dimensional space, so that a single observation is a column vector containing four doubles.

obsInfo = rlNumericSpec([4 1]);

Create an action specification object (or alternatively use getActionInfo to extract the specification object from an environment). For this example, define the action space as a continuous two-dimensional space, so that a single action is a column vector containing two doubles.

actInfo = rlNumericSpec([2 1]);

Create a deep neural network to approximate the Q-value function. The network must have two inputs, one for the observation and one for the action. The observation input must accept a four-element vector (the observation vector defined by obsInfo). The action input must accept a two-element vector (the action vector defined by actInfo). The output of the network must be a scalar, representing the expected cumulative long-term reward when the agent starts from the given observation and takes the given action.

% observation path layers
obsPath = [featureInputLayer(4,'Normalization','none') 
           fullyConnectedLayer(1,'Name','obsout')];

% action path layers
actPath = [featureInputLayer(2,'Normalization','none') 
           fullyConnectedLayer(1,'Name','actout')];

% common path to output layers
comPath = [additionLayer(2,'Name', 'add')  ...
           fullyConnectedLayer(1, 'Name', 'output')];

% add layers to network object
net = addLayers(layerGraph(obsPath),actPath); 
net = addLayers(net,comPath);

% connect layers
net = connectLayers(net,'obsout','add/in1');
net = connectLayers(net,'actout','add/in2');

% plot network
plot(net)

% convert the network to a dlnetwork object
net = dlnetwork(net);

Create the critic with rlQValueFunction, using the network as well as the observations and action specification objects. When using this syntax, the network input layers are automatically associated with the components of the observation and action signals according to the dimension specifications in obsInfo and actInfo.

critic = rlQValueFunction(net,obsInfo,actInfo)
critic = 
  rlQValueFunction with properties:

    ObservationInfo: [1×1 rl.util.rlNumericSpec]
         ActionInfo: [1×1 rl.util.rlNumericSpec]
          UseDevice: "cpu"

To check your critic, use the getValue function to return the value of a random observation and action, using the current network weights.

v = getValue(critic,{rand(4,1)},{rand(2,1)})
v = single
    0.9559

You can now use the critic (along with an with an actor) to create an agent relying on a Q-value function critic (such as an rlQAgent, rlDQNAgent, rlSARSAAgent, or rlDDPGAgent agent).

Create an observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment). For this example, define the observation space as a continuous four-dimensional space, so that a single observation is a column vector containing four doubles.

obsInfo = rlNumericSpec([4 1]);

Create an action specification object (or alternatively use getActionInfo to extract the specification object from an environment). For this example, define the action space as a continuous two-dimensional space, so that a single action is a column vector containing two doubles.

actInfo = rlNumericSpec([2 1]);

Create a deep neural network to approximate the Q-value function. The network must have two inputs, one for the observation and one for the action. The observation input (here called netObsInput) must accept a four-element vector (the observation vector defined by obsInfo). The action input (here called netActInput) must accept a two-element vector (the action vector defined by actInfo). The output of the network must be a scalar, representing the expected cumulative long-term reward when the agent starts from the given observation and takes the given action.

% observation path layers
obsPath = [featureInputLayer(4, ...
               'Normalization','none','Name','netObsInput') 
           fullyConnectedLayer(1,'Name','obsout')];

% action path layers
actPath = [featureInputLayer(2, ...
               'Normalization','none','Name','netActInput') 
           fullyConnectedLayer(1,'Name','actout')];

% common path to output layers
comPath = [additionLayer(2,'Name', 'add')  ...
           fullyConnectedLayer(1, 'Name', 'output')];

% add layers to network object
net = addLayers(layerGraph(obsPath),actPath); 
net = addLayers(net,comPath);

% connect layers
net = connectLayers(net,'obsout','add/in1');
net = connectLayers(net,'actout','add/in2');

% plot network
plot(net)

% convert the network to a dlnetwork object
net = dlnetwork(net);

Create the critic with rlQValueFunction, using the network, the observations and action specification objects, as well as the names of the network input layers to be associated with the observation and action from the environment.

critic = rlQValueFunction(net,...
             obsInfo,actInfo, ...
             'ObservationInputNames',{'netObsInput'},...
             'ActionInputNames',{'netActInput'})
critic = 
  rlQValueFunction with properties:

    ObservationInfo: [1×1 rl.util.rlNumericSpec]
         ActionInfo: [1×1 rl.util.rlNumericSpec]
          UseDevice: "cpu"

To check your critic, use the getValue function to return the value of a random observation and action, using the current network weights.

v = getValue(critic,{rand(4,1)},{rand(2,1)})
v = single
    0.1102

You can now use the critic (along with an with an actor) to create an agent relying on a Q-value function critic (such as an rlQAgent, rlDQNAgent, rlSARSAAgent, or rlDDPGAgent agent).

Create a finite set observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment with a discrete observation space). For this example define the observation space as a finite set with of 4 possible values.

obsInfo = rlFiniteSetSpec([7 5 3 1]);

Create a finite set action specification object (or alternatively use getActionInfo to extract the specification object from an environment with a discrete action space). For this example define the action space as a finite set with 2 possible values.

actInfo = rlFiniteSetSpec([4 8]);

Create a table to approximate the value function within the critic. rlTable creates a value table object from the observation and action specifications objects.

qTable = rlTable(obsInfo,actInfo);

The table stores a value (representing the expected cumulative long term reward) for each possible observation-action pair. Each row corresponds to an observation and each column corresponds to an action. You can access the table using the Table property of the vTable object. The initial value of each element is zero.

qTable.Table
ans = 4×2

     0     0
     0     0
     0     0
     0     0

You can initialize the table to any value, in this case, an array containing the integer from 1 through 8.

qTable.Table=reshape(1:8,4,2)
qTable = 
  rlTable with properties:

    Table: [4×2 double]

Create the critic using the table as well as the observations and action specification objects.

critic = rlQValueFunction(qTable,obsInfo,actInfo)
critic = 
  rlQValueFunction with properties:

    ObservationInfo: [1×1 rl.util.rlFiniteSetSpec]
         ActionInfo: [1×1 rl.util.rlFiniteSetSpec]
          UseDevice: "cpu"

To check your critic, use the getValue function to return the value of a given observation and action, using the current table entries.

v = getValue(critic,{5},{8})
v = 6

You can now use the critic (along with an with an actor) to create a discrete action space agent relying on a Q-value function critic (such as an rlQAgent, rlDQNAgent, or rlSARSAAgent agent).

Create an observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment). For this example, define the observation space as a continuous three-dimensional space, so that a single observation is a column vector containing 3 doubles.

obsInfo = rlNumericSpec([3 1]);

Create an action specification object (or alternatively use getActionInfo to extract the specification object from an environment). For this example, define the action space as a continuous two-dimensional space, so that a single action is a column vector containing 2 doubles.

actInfo = rlNumericSpec([2 1]);

Create a custom basis function to approximate the value function within the critic. The custom basis function must return a column vector. Each vector element must be a function of the observations and actions respectively defined by obsInfo and actInfo.

myBasisFcn = @(myobs,myact) [ ...
    myobs(2)^2; ...
    myobs(1)+exp(myact(1)); ...
    abs(myact(2)); ...
    myobs(3) ]
myBasisFcn = function_handle with value:
    @(myobs,myact)[myobs(2)^2;myobs(1)+exp(myact(1));abs(myact(2));myobs(3)]

The output of the critic is the scalar W'*myBasisFcn(myobs,myact), where W is a weight column vector which must have the same size of the custom basis function output. This output is the expected cumulative long term reward when the agent starts from the given observation and takes the best possible action. The elements of W are the learnable parameters.

Define an initial parameter vector.

W0 = [1;4;4;2];

Create the critic. The first argument is a two-element cell containing both the handle to the custom function and the initial weight vector. The second and third arguments are, respectively, the observation and action specification objects.

critic = rlQValueFunction({myBasisFcn,W0}, ...
    obsInfo,actInfo)
critic = 
  rlQValueFunction with properties:

    ObservationInfo: [1x1 rl.util.rlNumericSpec]
         ActionInfo: [1x1 rl.util.rlNumericSpec]
          UseDevice: "cpu"

To check your critic, use getValue to return the value of a given observation-action pair, using the current parameter vector.

v = getValue(critic,{[1 2 3]'},{[4 5]'})
v = 252.3926

You can now use the critic (along with an with an actor) to create an agent relying on a Q-value function critic (such as an rlQAgent, rlDQNAgent, rlSARSAAgent, or rlDDPGAgent agent).

Create an observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment). For this example, define the observation space as consisting of two channels, the first is vector over a continuous two-dimensional space, and the second is a vector over a three-dimensional space that can assume only four values.

obsInfo = [rlNumericSpec([1 2]) 
           rlFiniteSetSpec({[1 0 -1], ...
                            [-1 2 1], ...
                            [0.1 0.2 0.3], ...
                            [0 0 0]})];

Create an action specification object (or alternatively use getActionInfo to extract the specification object from an environment). For this example, define the action space as discrete set consisting of three possible actions, 1, 2, and 3.

actInfo = rlFiniteSetSpec({1,2,3});

Create a custom basis function to approximate the value function within the critic. The custom basis function must return a column vector. Each vector element must be a function of the observations and the action respectively defined by obsInfo and actInfo. Note that the selected action, as defined, has only one element, while each observation channel has two elements.

myBasisFcn = @(obsA,obsB,act) [obsA(1)+obsB(2)+obsB(3)+act(1);
                               obsA(2)+obsB(1)+obsB(2)-act(1);
                               obsA(1)+obsB(2)+obsB(3)+act(1)^2;
                               obsA(1)+obsB(1)+obsB(2)-act(1)^2];

The output of the critic is the scalar W'*myBasisFcn(obsA,obsB,act), where W is a weight column vector which must have the same size of the custom basis function output. This output is the expected cumulative long term reward when the agent starts from the given observation and takes the action specified as last input. The elements of W are the learnable parameters.

Define an initial parameter vector.

W0 = ones(4,1);

Create the critic. The first argument is a two-element cell containing both the handle to the custom function and the initial weight vector. The second and third arguments are, respectively, the observation and action specification objects.

critic = rlQValueFunction({myBasisFcn,W0},obsInfo,actInfo)
critic = 
  rlQValueFunction with properties:

    ObservationInfo: [2×1 rl.util.RLDataSpec]
         ActionInfo: [1×1 rl.util.rlFiniteSetSpec]
          UseDevice: "cpu"

To check your critic, use the getValue function to return the value of a given observation-action pair, using the current parameter vector.

v = getValue(critic,{[-0.5 0.6],[1 0 -1]},{3})
v = -0.9000

Note that the critic does not enforce the set constraint for the discrete set elements.

v = getValue(critic,{[-0.5 0.6],[10 -10 -0.05]},{33})
v = -21.0000

You can now use the critic (along with an with an actor) to create an agent with a discrete action space relying on a Q-value function critic (such as an rlQAgent, rlDQNAgent, or rlSARSAAgent agent).

Version History

Introduced in R2022a