Reinforcement Learning Using Deep Neural Networks

Reinforcement learning is a goal-directed computational approach where a computer learns to perform a task by interacting with an unknown dynamic environment. This learning approach enables the computer to make a series of decisions to maximize the cumulative reward for the task without human intervention and without being explicitly programmed to achieve the task. The following diagram shows a general representation of a reinforcement learning scenario.

The goal of reinforcement learning is to train the policy of an agent to complete a task within an unknown environment. The agent receives observations and a reward from the environment and sends actions to the environment. The reward is a measure of how successful an action is with respect to completing the task goal.

To create and train reinforcement learning agents, you can use Reinforcement Learning Toolbox™ software. Typically, agent policies are implemented using deep neural networks, which you can create using Deep Learning Toolbox™ software.

Reinforcement learning is useful for many control and planning applications. The following examples show how to train reinforcement learning agents for robotics and automated driving tasks.

Reinforcement Learning Workflow

The general workflow for training an agent using reinforcement learning includes the following steps.

Formulate problem — Define the task for the agent to learn, including how the agent interacts with the environment and any primary and secondary goals the agent must achieve.
Create environment — Define the environment within which the agent operates, including the interface between agent and environment and the environment dynamic model.
Define reward — Specify the reward signal that the agent uses to measure its performance against the task goals and how to calculate this signal from the environment.
Create agent — Create the agent, which includes defining a policy representation and configuring the agent learning algorithm.
Train agent — Train the agent policy representation using the defined environment, reward, and agent learning algorithm.
Validate agent — Evaluate the performance of the trained agent by simulating the agent and environment together.
Deploy policy — Deploy the trained policy representation using, for example, generated GPU code.

Training an agent using reinforcement learning is an iterative process. Decisions and results in later stages can require you to return to an earlier stage in the learning workflow. For example, if the training process does not converge to an optimal policy within a reasonable amount of time, you might have to update any of the following before retraining the agent:

Training settings
Learning algorithm configuration
Policy representation
Reward signal definition
Action and observation signals
Environment dynamics

Reinforcement Learning Environments

In a reinforcement learning scenario, where you train an agent to complete a task, the environment models the dynamics with which the agent interacts. The environment:

Receives actions from the agent.
Outputs observations in response to the actions.
Generates a reward measuring how well the action contributes to achieving the task.

Creating an environment model includes defining the following:

Action and observation signals that the agent uses to interact with the environment.
Reward signal that the agent uses to measure its success. For more information, see Define Reward and Observation Signals in Custom Environments (Reinforcement Learning Toolbox).
Environment dynamic behavior.

You can create an environment in either MATLAB^® or Simulink^®. For more information, see Reinforcement Learning Environments (Reinforcement Learning Toolbox) and Create Custom Simulink Environments (Reinforcement Learning Toolbox).

Reinforcement Learning Agents

A reinforcement learning agent contains two components: a policy and a learning algorithm.

The policy is a mapping that selects actions based on observations from the environment. Typically, the policy is a function approximator with tunable parameters, such as a deep neural network.
The learning algorithm continuously updates the policy parameters based on the actions, observations, and reward. The goal of the learning algorithm is to find an optimal policy that maximizes the cumulative reward received during the task.

Agents are distinguished by their learning algorithms and policy representations. Agents can operate in discrete action spaces, continuous action spaces, or both. In a discrete action space, the agent selects actions from a finite set of possible actions. In a continuous action space, the agent selects an action from a continuous range of possible action values. Reinforcement Learning Toolbox software supports the following types of agents.

Agent	Action Space
Q-Learning Agents (Reinforcement Learning Toolbox)	Discrete
Deep Q-Network (DQN) Agents (Reinforcement Learning Toolbox)	Discrete
SARSA Agents (Reinforcement Learning Toolbox)	Discrete
Policy Gradient (PG) Agents (Reinforcement Learning Toolbox)	Discrete or continuous
Actor-Critic (AC) Agents (Reinforcement Learning Toolbox)	Discrete or continuous
Proximal Policy Optimization (PPO) Agents (Reinforcement Learning Toolbox)	Discrete or continuous
Deep Deterministic Policy Gradient (DDPG) Agents (Reinforcement Learning Toolbox)	Continuous
Twin-Delayed Deep Deterministic (TD3) Policy Gradient Agents (Reinforcement Learning Toolbox)	Continuous
Soft Actor-Critic (SAC) Agents (Reinforcement Learning Toolbox)	Continuous

For more information, see Reinforcement Learning Agents (Reinforcement Learning Toolbox).

Create Deep Neural Network Policies and Value Functions

Depending on the type of agent you use, its policy and learning algorithm require one or more policy and value function representations, which you can implement using deep neural networks.

Reinforcement Learning Toolbox supports the following types of value function and policy representations.

V(S|θ_V) — Critics that estimate the expected cumulative long-term reward (value function) based on a given observation S.
Q(S,A|θ_Q) — Critics that estimate the value function for a given discrete action A and a given observation S.
Q_i(S,A_i|θ_Q) — Multi-output critics that estimate the value function for all possible discrete actions A_i and a given observation S.
μ(S|θ_μ) — Actors that select an action based on a given observation S. Actors can select actions using either deterministic or stochastic methods.

During training, the agent updates the parameters of these representations (θ_V, θ_Q, and θ_μ).

You can create most Reinforcement Learning Toolbox agents with default policy and value function representations. The agents define the network inputs and outputs based on the action and observation specifications from the environment.

Alternatively, you can create actor and critic representations for your agent using Deep Learning Toolbox functionality, such as the Deep Network Designer app. In this case, ensure that the input and output dimensions of the actor and critic representations match the corresponding action and observation specifications of the environment. For an example that creates a critic representation using Deep Network Designer, see Create DQN Agent Using Deep Network Designer and Train Using Image Observations.

Deep neural networks consist of a series of interconnected layers. For a full list of available layers, see List of Deep Learning Layers.

All agents, except Q-learning and SARSA agents, support recurrent neural networks (RNNs). These networks have an input sequenceInputLayer and at least one layer that has hidden state information, such as an lstmLayer. These networks can be especially useful when the environment has states that are not in the observation vector.

For more information on creating agents and their associated value function and policy representations, see the corresponding agent pages in the previous table.

Reinforcement Learning Toolbox software provides additional layers that you can use when creating deep neural network representations.

Layer	Description
`scalingLayer` (Reinforcement Learning Toolbox)	Applies a linear scale and bias to an input array. This layer is useful for scaling and shifting the outputs of nonlinear layers, such as `tanhLayer` and `sigmoidLayer`.
`quadraticLayer` (Reinforcement Learning Toolbox)	Creates a vector of quadratic monomials constructed from the elements of the input array. This layer is useful when you need an output that is some quadratic function of its inputs, such as for an LQR controller.
`softplusLayer` (Reinforcement Learning Toolbox)	Implements the softplus activation Y = log(1 + e^X), which ensures that the output is always positive. This is a smoothed version of the rectified linear unit (ReLU).

For more information on creating policy and value function representations, see Create Policies and Value Functions (Reinforcement Learning Toolbox).

You can also import pretrained deep neural networks or deep neural network layer architectures using the Deep Learning Toolbox network import functionality. For more information, see Import Neural Network Models Using ONNX (Reinforcement Learning Toolbox).

Train Reinforcement Learning Agents

Once you create an environment and reinforcement learning agent, you can train the agent in the environment using the train (Reinforcement Learning Toolbox) function. To configure your training, use an rlTrainingOptions (Reinforcement Learning Toolbox) object. For more information, see Train Reinforcement Learning Agents (Reinforcement Learning Toolbox)

If you have Parallel Computing Toolbox™ software, you can accelerate training and simulation by using multicore processors or GPUs. For more information, see Train Agents Using Parallel Computing and GPUs (Reinforcement Learning Toolbox).

Deploy Trained Policies

Once you train a reinforcement learning agent, you can generate code to deploy the optimal policy. You can generate:

CUDA^® code using GPU Coder™
C/C++ code using MATLAB Coder™

To create a policy evaluation function that selects an action based on a given observation, use the generatePolicyFunction (Reinforcement Learning Toolbox) command. This command generates a MATLAB script, which contains the policy evaluation function, and a MAT-file, which contains the optimal policy data.

You can generate code to deploy this policy function using GPU Coder or MATLAB Coder.

For more information, see Deploy Trained Reinforcement Learning Policies (Reinforcement Learning Toolbox).