Train DDPG, TD3 or SAC agent using an evolutionary strategy within a specified environment
trainStats = trainWithEvolutionStrategy(
agent within the environment
env, using the
evolution strategy training options object
trainOpts. Note that
agent is an handle object and it is updated during training, despite
being an input argument. For more information on the training algorithm, see Train agent with evolution strategy.
Train Agent Using an Evolutionary Strategy
This example shows how to train a DDPG agent using an evolutionary strategy.
Load the predefined environment object representing a cart-pole system with a continuous action space. For more information on this environment, see Load Predefined Control System Environments.
env = rlPredefinedEnv("CartPole-Continuous");
The agent networks are initialized randomly. Ensure reproducibility by fixing the seed of the random generator.
Create a DDPG agent with default networks.
agent = rlDDPGAgent(getObservationInfo(env),getActionInfo(env));
To create an evolution strategy options object, use
estOpts = rlEvolutionStrategyTrainingOptions(... PopulationSize=10 , ... ReturnedPolicy="BestPolicy" , ... StopTrainingCriteria="EpisodeCount" , ... StopTrainingValue=100);
To train the agent, use
trainStats = trainWithEvolutionStrategy(agent,env,estOpts);
Display the reward accumulated during the last episode.
ans = 496.2431
This value means that the agent is able to balance the cart-pole system for the whole episode.
agent — DDPG, TD3 or SAC agent
rlDDPGAgent object |
rlTD3Agent object |
trainWithEvolutionStrategy updates the agent as training
progresses. For more information on how to preserve the original agent, how to save an
agent during training, and on the state of
agent after training, see
the notes and the tips section in
more information about handle objects, see Handle Object Behavior.
For more information about how to create and configure agents for reinforcement learning, see Reinforcement Learning Agents.
env — Environment
reinforcement learning environment object
Environment in which the agent acts, specified as one of the following kinds of reinforcement learning environment object:
Multiagent environments do not support training agents with an evolution strategy.
For more information about creating and configuring environments, see:
env is a Simulink environment, calling
compiles and simulates the model associated with the environment.
estOpts — Parameters and options for training using an evolution strategy
Parameters and options for training using an evolution strategy, specified as an
rlEvolutionStrategyTrainingOptions object. Use this argument to specify
parameters and options such as:
Population update method
Number training epochs
Criteria for saving candidate agents
How to display training progress
trainWithEvolutionStrategy does not support parallel
For details, see
trainStats — Training episode data
Training episode data, returned as an
rlTrainingResult object. The
following properties pertain to the
EpisodeIndex — Episode numbers
Episode numbers, returned as the column vector
N is the number of episodes in the training run. This
vector is useful if you want to plot the evolution of other quantities from
episode to episode.
EpisodeReward — Reward for each episode
Reward for each episode, returned in a column vector of length
N. Each entry contains the reward for the corresponding
EpisodeSteps — Number of steps in each episode
Number of steps in each episode, returned in a column vector of length
N. Each entry contains the number of steps in the
AverageReward — Average reward over the averaging window
Average reward over the averaging window specified in
trainOpts, returned as a column vector of length
N. Each entry contains the average award computed at the end
of the corresponding episode.
TotalAgentSteps — Total number of steps
Total number of agent steps in training, returned as a column vector of length
N. Each entry contains the cumulative sum of the entries in
EpisodeSteps up to that point.
EpisodeQ0 — Critic estimate of expected discounted cumulative long-term reward at the beginning of each episode
Critic estimate of expected discounted cumulative long-term reward using the
current agent and the environment initial conditions, returned as a column vector
N. Each entry is the critic estimate
(Q0) for the agent at the beginning of
the corresponding episode. This field is present only for agents that have
critics, such as
SimulationInfo — Information collected during simulation
structure | vector of
Information collected during the simulations performed for training, returned as:
For training in MATLAB environments, a structure containing the field
SimulationError. This field is a column vector with one entry per episode. When the
"off", each entry contains any errors that occurred during the corresponding episode. Otherwise, the field contains an empty array.
For training in Simulink environments, a vector of
Simulink.SimulationOutputobjects containing simulation data recorded during the corresponding episode. Recorded data for an episode includes any signals and states that the model is configured to log, simulation metadata, and any errors that occurred during the corresponding episode.
EvaluationStatistic — Evaluation statistic for each episode
TrainingOptions — Training options set
Training options set, returned as an
Introduced in R2023b