rlPrioritizedReplayMemory
Replay memory experience buffer with prioritized sampling
Description
An off-policy reinforcement learning agent stores experiences in a circular experience buffer. During training, the agent samples mini-batches of experiences from the buffer and uses these mini-batches to update its actor and critic function approximators.
By default, built-in off-policy agents (DQN, DDPG, TD3, SAC, MBPO) use an rlReplayMemory
object
as their experience buffer. Agents uniformly sample data from this buffer. To perform
nonuniform prioritized sampling [1], which can improve sample
efficiency when training your agent, use an rlPrioritizedReplayMemory
object.
For more information on prioritized sampling, see Algorithms.
Creation
Syntax
Description
Input Arguments
Properties
Object Functions
append | Append experiences to replay memory buffer |
sample | Sample experiences from replay memory buffer |
resize | Resize replay memory experience buffer |
allExperiences | Return all experiences in replay memory buffer |
getActionInfo | Obtain action data specifications from reinforcement learning environment, agent, or experience buffer |
getObservationInfo | Obtain observation data specifications from reinforcement learning environment, agent, or experience buffer |
reset | Reset environment, agent, experience buffer, or policy object |
Examples
Limitations
Prioritized experience replay does not support agents that use recurrent neural networks.
Algorithms
Prioritized replay memory samples experiences according to experience priorities. For a given experience, the priority is defined as the absolute value of the associated temporal difference (TD) error. A larger TD error indicates that the critic network is not well-trained for the corresponding experience. Therefore, sampling such experiences during critic updates can help efficiently improve the critic performance, which often improves the sample efficiency of agent training.
When using prioritized replay memory, agents use the following process when sampling a mini-batch of experiences and updating a critic.
Compute the sampling probability P for each experience in the buffer based on the experience priority.
Here:
N is the number of experiences in the replay memory buffer
p is the experience priority.
α is a priority exponent. To set α, use the
PriorityExponent
parameter.
Sample a mini-batch of experiences according to the computed probabilities.
Compute the importance sampling weights (w) for the sampled experiences.
Here, β is the importance sampling exponent. The
ImportanceSamplingExponent
parameter contains the current value of β. To control β, set theImportanceSamplingExponent
andNumAnnealingSteps
parameters.Compute the weighted loss using the importance sampling weights w and the TD error δ to update a critic
Update the priorities of the sampled experiences based on the TD error.
Update the importance sampling exponent β by linearly annealing the exponent value until it reaches 1.
Here:
β0 is the initial importance sampling exponent. To specify β0, use the
InitialImportanceSamplingExponent
parameter.NS is the number of annealing steps. To specify Ns, use the
NumAnnealingSteps
parameter.
References
[1] Schaul, Tom, John Quan, Ioannis Antonoglou, and David Silver. 'Prioritized experience replay'. arXiv:1511.05952 [Cs] 25 February 2016. https://arxiv.org/abs/1511.05952.
Version History
Introduced in R2022b