Main Content

Train Agents Using Parallel Computing and GPUs

If you have Parallel Computing Toolbox™ software, you can run parallel simulations on multicore processors or GPUs. If you additionally have MATLAB® Parallel Server™ software, you can run parallel simulations on computer clusters or cloud resources.

Independently on which devices you use to simulate or train the agent, once the agent has been trained, you can generate code to deploy the optimal policy on a CPU or GPU. This is explained in more detail in Deploy Trained Reinforcement Learning Policies.

Using Multiple Processes

When you train agents using parallel computing, the parallel pool client (the MATLAB process that starts the training) sends copies of both its agent and environment to each parallel worker. Each worker simulates the agent within the environment and sends their simulation data back to the client. The client agent learns from the data sent by the workers and sends the updated policy parameters back to the workers.

To create a parallel pool of N workers, use the following syntax.

pool = parpool(N);

If you do not create a parallel pool using parpool (Parallel Computing Toolbox), the train function automatically creates one using your default parallel pool preferences. For more information on specifying these preferences, see Specify Your Parallel Preferences (Parallel Computing Toolbox). Note that using a parallel pool of thread workers, such as pool = parpool("threads"), is not supported.

To train an agent using multiple processes you must pass to the train function an rlTrainingOptions object in which UseParallel is set to true.

For more information on configuring your training to use parallel computing, see the UseParallel and ParallelizationOptions options in rlTrainingOptions.

Note that parallel simulation and training of environments containing multiple agents is not supported.

For an example that trains an agent using parallel computing in MATLAB, see Train AC Agent to Balance Cart-Pole System Using Parallel Computing. For an example that trains an agent using parallel computing in Simulink®, see Train DQN Agent for Lane Keeping Assist Using Parallel Computing and Train Biped Robot to Walk Using Reinforcement Learning Agents.

Agent-Specific Parallel Training Considerations

For off-policy agents, such as DDPG and DQN agents, do not use all of your cores for parallel training. For example, if your CPU has six cores, train with four workers. Doing so provides more resources for the parallel pool client to compute gradients based on the experiences sent back from the workers. Limiting the number of workers is not necessary for on-policy agents, such as AC and PG agents, when the gradients are computed on the workers.

Gradient-Based Parallelization (AC and PG Agents)

To train AC and PG agents in parallel, the DataToSendFromWorkers property of the ParallelTraining object (contained in the training options object) must be set to "gradients".

This configures the training so that both the environment simulation and gradient computations are done by the workers. Specifically, workers simulate the agent against the environment, compute the gradients from experiences, and send the gradients to the client. The client averages the gradients, updates the network parameters and sends the updated parameters back to the workers to they can continue simulating the agent with the new parameters.

With gradient-based parallelization, you can in principle achieve a speed improvement which is nearly linear in the number of workers. However, this option requires synchronous training (that is the Mode property of the rlTrainingOptions object that you pass to the train function must be set to "sync"). This means that workers must pause execution until all workers are finished, and as a result the training only advances as fast as the slowest worker allows.

When AC agents are trained in parallel, a warning is generated if the NumStepToLookAhead property of the AC agent option object and the StepsUntilDataIsSent property of the ParallelizationOptions object are set to different values.

Experience-Based Parallelization (DQN, DDPG, PPO, TD3, and SAC agents)

To train DQN, DDPG, PPO, TD3, and SAC agents in parallel, the DataToSendFromWorkers property of the ParallelizationOptions object (contained in the training options object) must be set to "experiences". This option does not require synchronous training (that is the Mode property of the rlTrainingOptions object that you pass to the train function can be set to "async").

This configures the training so that the environment simulation is done by the workers and the learning is done by the client. Specifically, the workers simulate the agent against the environment, and send experience data (observation, action, reward, next observation, and a termination signal) to the client. The client then computes the gradients from experiences, updates the network parameters and sends the updated parameters back to the workers, which continue to simulate the agent with the new parameters .

Experience-based parallelization can reduce training time only when the computational cost of simulating the environment is high compared to the cost of optimizing network parameters. Otherwise, when the environment simulation is fast enough, the workers lie idle waiting for the client to learn and send back the updated parameters.

In other words, experience-based parallelization can improve sample efficiency (intended as the number of samples an agent can process within a given time) only when the ratio R between the environment step complexity and the learning complexity is large. If both environment simulation and learning are similarly computationally expensive, experience-based parallelization is unlikely to improve sample efficiency. However, in this case, for off-policy agents, you can reduce the mini-batch size to make R larger, thereby improving sample efficiency.

To enforce contiguity in the experience buffer when training DQN, DDPG, TD3, or SAC agents in parallel, set the NumStepsToLookAhead property or the corresponding agent option object to 1. A different value causes an error when parallel training is attempted.

Using GPUs

When using deep neural network function approximators for your actor or critic representation, you can speed up training by performing representation operations (such as gradient computation and prediction), on a local GPU rather than a CPU. To do so, when creating a critic or actor representation, use an rlRepresentationOptions object in which the UseDevice option is set to "gpu" instead of "cpu".

opt = rlRepresentationOptions('UseDevice',"gpu");

The "gpu" option requires both Parallel Computing Toolbox software and a CUDA® enabled NVIDIA® GPU. For more information on supported GPUs see GPU Support by Release (Parallel Computing Toolbox).

You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be used with MATLAB.

Using GPUs is likely to be beneficial when the deep neural network in the actor or critic representation uses operations such as multiple convolutional layers on input images or has large batch sizes.

For an example on how to train an agent using the GPU, see Train DDPG Agent to Swing Up and Balance Pendulum with Image Observation.

Using both Multiple Processes and GPUs

You can also train agents using both multiple processes and a local GPU (previously selected using gpuDevice (Parallel Computing Toolbox)) at the same time. Specifically, you can create a critic or actor using an rlRepresentationOptions object in which the UseDevice option is set to "gpu". You can then use the critic and actor to create an agent, and then train the agent using multiple processes. This is done by creating an rlTrainingOptions object in which UseParallel is set to true and passing it to the train function.

For gradient-based parallelization, (which must run in synchronous mode) the environment simulation is done by the workers, which use their local GPU to calculate the gradients and perform a prediction step. The gradients are then sent back to the parallel pool client process which calculates the averages, updates the network parameters and sends them back to the workers so they continue to simulate the agent, with the new parameters, against the environment.

For experience-based parallelization, (which can run in asynchronous mode), the workers simulate the agent against the environment, and send experiences data back to the parallel pool client. The client then uses its local GPU to compute the gradients from the experiences, then updates the network parameters and sends the updated parameters back to the workers, which continue to simulate the agent, with the new parameters, against the environment.

Note that when using both parallel processing and GPU to train PPO agents, the workers use their local GPU to compute the advantages, and then send processed experience trajectories (which include advantages, targets and action probabilities) back to the client.

See Also

| |

Related Topics