rlHindsightReplayMemory

Hindsight replay memory experience buffer

Since R2023a

Description

An off-policy reinforcement learning agent stores experiences in a circular experience buffer.

During training the agent stores each of its experiences (S,A,R,S',D) in the buffer. Here:

S is the current observation of the environment.
A is the action taken by the agent.
R is the reward for taking action A.
S' is the next observation after taking action A.
D is the is-done signal after taking action A.

The agent then samples mini-batches of experiences from the buffer and uses these mini-batches to update its actor and critic function approximators.

By default, built-in off-policy agents (DQN, DDPG, TD3, SAC, MBPO) use an rlReplayMemory object as their experience buffer. For goal-conditioned tasks, where the observation includes both the goal and a goal measurement, you can use an rlHindsightReplayMemory object.

A hindsight replay memory experience buffer:

Generates additional experiences by replacing goals with goal measurements
Improves sample efficiency for tasks with sparse rewards
Requires a ground-truth reward function and is-done function
Is not necessary when you have a well-shaped reward function

rlHindsightReplayMemory objects uniformly sample experiences from the buffer. To use prioritized nonuniform sampling, which can improve sample efficiency, use an rlHindsightPrioritizedReplayMemory object.

For more information on hindsight experience replay, see Algorithms.

Creation

Syntax

buffer = rlHindsightReplayMemory(obsInfo,actInfo,rewardFcn,isDoneFcn,goalConditionInfo)

buffer = rlHindsightReplayMemory(obsInfo,actInfo,rewardFcn,isDoneFcn,goalConditionInfo,maxLength)

Description

buffer = rlHindsightReplayMemory(obsInfo,actInfo,rewardFcn,isDoneFcn,goalConditionInfo) creates a hindsight replay memory experience buffer that is compatible with the observation and action specifications in obsInfo and actInfo, respectively. This syntax sets the RewardFcn, IsDoneFcn, and GoalConditionInfo properties.

example

buffer = rlHindsightReplayMemory(obsInfo,actInfo,rewardFcn,isDoneFcn,goalConditionInfo,maxLength) sets the maximum length of the buffer by setting the MaxLength property.

example

Input Arguments

expand all

`obsInfo` — Observation specifications
specification object | array of specification objects

Observation specifications, specified as a reinforcement learning specification object or an array of specification objects defining properties such as dimensions, data types, and names of the observation signals.

You can extract the observation specifications from an existing environment or agent using getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec or rlNumericSpec.

`actInfo` — Action specifications
specification object | array of specification objects

Action specifications, specified as a reinforcement learning specification object defining properties such as dimensions, data types, and names of the action signals.

You can extract the action specifications from an existing environment or agent using getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or rlNumericSpec.

Properties

expand all

`MaxLength` — Maximum buffer length
`10000` (default) | nonnegative integer

This property is read-only.

Maximum buffer length, specified as a nonnegative integer.

To change the maximum buffer length, use the resize function.

`Length` — Number of experiences in buffer
`0` (default) | nonnegative integer

This property is read-only.

Number of experiences in buffer, specified as a nonnegative integer.

`GoalConditionInfo` — Goal condition information
1-by-N cell array

Goal condition information, specified as a 1-by-N cell array, where N is the number of goal conditions. For the ith goal condition, the corresponding cell of GoalConditionInfo contains a 1-by-4 cell array with the following elements.

GoalConditionInfo{i}{1} — Goal measurement channel index.
GoalConditionInfo{i}{2} — Goal measurement element indices.
GoalConditionInfo{i}{3} — Goal channel index.
GoalConditionInfo{i}{4} — Goal element indices.

The goal measurements in GoalConditionInfo{i}{2} correspond to the goals in GoalConditionInfo{i}{4}.

As an example, suppose that obsInfo contains specifications for two observation channels. Further, suppose that there is one goal condition where the goal measurements correspond to elements 2 and 3 of the first observation channel, and the goals correspond to elements 4 and 5 of the second observation channel. In this case, the goal condition information is:

GoalConditionInfo = {{1,[1 2],2,[4 5]}};

`RewardFcn` — Reward function
function handle

Reward function, specified as a handle to a function with the following signature.

function reward = myRewardFcn(obs,action,nextObs)

Here:

reward is a scalar reward value.
obs is the current observation.
act is the action taken from the current observation.
nextObs is the next observation after taking the specified action.

`IsDoneFcn` — Is-done function
function handle

Is-done function, specified as a handle to a function with the following signature.

function isdone = myIsDoneFcn(obs,action,nextObs)

Here:

isdone is true when the next observation is a terminal condition and false otherwise.
obs is the current observation.
act is the action taken from the current observation.
nextObs is the next observation after taking the specified action.

`Strategy` — Goal measurement sampling strategy
`"final"` (default) | `"episode"` | `"future"`

Goal measurement sampling strategy, specified as one of the following values.

"final" — Use the goal measurement from the end of the trajectory.
"episode" — Randomly sample M goal measurements from the trajectory, where M is equal to NumGoalSamples.
"future" — Randomly sample M goal measurements from the trajectory, but create hindsight experiences for measurements that were observed at time t+1 or later.

`NumGoalSamples` — Number of goal measurements
`4` (default) | positive integer

Number of goal measurements to sample when generating experiences, specified as a positive integer. This parameter is ignored when Strategy is "final".

Object Functions

`append`	Append experiences to replay memory buffer
`sample`	Sample experiences from replay memory buffer
`resize`	Resize replay memory experience buffer
`reset`	Reset environment, agent, experience buffer, or policy object
`allExperiences`	Return all experiences in replay memory buffer
`validateExperience`	Validate experiences for replay memory
`generateHindsightExperiences`	Generate hindsight experiences from hindsight experience replay buffer
`getActionInfo`	Obtain action data specifications from reinforcement learning environment, agent, or experience buffer
`getObservationInfo`	Obtain observation data specifications from reinforcement learning environment, agent, or experience buffer

Examples

collapse all

Create DDPG Agent and Set Hindsight Replay Memory

Open Live Script

For this example, create observation and action specifications directly. You can also extract such specifications from your environment.

Create an observation specification for an environment with a single observation channel with six observations. For this example, assume that the observation channel contains the signals [ $a$ , $x_{m}$ , $y_{m}$ , $x_{g}$ , $y_{g}$ , $c$ ], where:

$x_{g}$ and $y_{g}$ are the goal observations.
$x_{m}$ and $y_{m}$ are the goal measurements.
$a$ and $c$ are additional observations.

obsInfo = rlNumericSpec([6 1]);

Create a specification for a single action.

actInfo = rlNumericSpec([1 1]);

Create a DDPG agent from the environment specifications. By default, the agent uses a replay memory experience buffer with uniform sampling.

agent = rlDDPGAgent(obsInfo,actInfo);

To create a hindsight replay memory buffer, first define the goal condition information. Both the goals and goal measurements are in the single observation channel. The goal measurements are in elements 2 and 3 of the observation channel and the goals are in elements 4 and 5 of the observation channel.

goalConditionInfo = {{1,[2 3],1,[4 5]}};

Define an is-done function. For this example, the is-done signal is true when the next observation satisfies the goal condition $\sqrt{{(x_{m} - x_{g})}^{2} + {(y_{m} - y_{g})}^{2}} < 0.1$ .

function isdone = hindsightIsDoneFcn1(Observation,Action,NextObservation)
    NextObservation = NextObservation{1};
    xm = NextObservation(2);
    ym = NextObservation(3);
    xg = NextObservation(4);
    yg = NextObservation(5);
    isdone = sqrt((xm-xg)^2 + (ym-yg)^2) < 0.1;
end

Define a reward function. For this example, the reward is 1 when the is-done signal is true and –0.01 otherwise.

function reward = hindsightRewardFcn1(Observation,Action,NextObservation)
    isdone = hindsightIsDoneFcn1(Observation,Action,NextObservation);
    if isdone
        reward = 1;
    else
        reward = -0.01;
    end
end

Create a hindsight replay memory buffer with a default maximum length.

buffer = rlHindsightReplayMemory(obsInfo,actInfo, ...
    @hindsightRewardFcn1,@hindsightIsDoneFcn1,goalConditionInfo);

Replace the default experience buffer with the hindsight replay memory buffer.

agent.ExperienceBuffer = buffer;

Create Hindsight Replay Memory with Two Observation Channels

Open Live Script

For this example, create observation and action specifications directly. You can also extract such specifications from your environment.

Create observation specification for an environment with two observation channels. For this example, assume that the first observation channel contains the signals [ $a$ , $x_{m}$ , $y_{m}$ ] and the second observation channel contains the signals [ $x_{g}$ , $y_{g}$ , $c$ ], where:

$x_{g}$ and $y_{g}$ are the goal observations.
$x_{m}$ and $y_{m}$ are the goal measurements.
$a$ and $c$ are additional observations.

obsInfo = [rlNumericSpec([3 1]), rlNumericSpec([3 1])];

Create a specification for a single action.

actInfo = rlNumericSpec([1 1]);

To create a hindsight replay memory buffer, first define the goal condition information. The goal measurements are in elements 2 and 3 of the first observation channel and the goals are in elements 1 and 2 of the second observation channel.

goalConditionInfo = {{1,[2 3],2,[1 2]}};

Define an is-done function. For this example, the is-done signal is true when the next observation satisfies the goal condition $\sqrt{{(x_{m} - x_{g})}^{2} + {(y_{m} - y_{g})}^{2}} < 0.1$ .

function isdone = hindsightIsDoneFcn2(Observation,Action,NextObservation)
    NextObsCh1 = NextObservation{1};
    NextObsCh2 = NextObservation{2};
    xm = NextObsCh1(2);
    ym = NextObsCh1(3);
    xg = NextObsCh2(1);
    yg = NextObsCh2(2);
    isdone = sqrt((xm-xg)^2 + (ym-yg)^2) < 0.1;
end

Define a reward function. For this example, the reward is 1 when the is-done signal is true and 0 otherwise.

function reward = hindsightRewardFcn2(Observation,Action,NextObservation)
    isdone = hindsightIsDoneFcn1(Observation,Action,NextObservation);
    if isdone
        reward = 1;
    else
        reward = 0;
    end
end

Create a hindsight replay memory buffer with a maximum length of 20000.

buffer = rlHindsightReplayMemory(obsInfo,actInfo, ...
    @hindsightRewardFcn2,@hindsightIsDoneFcn2,...
    goalConditionInfo,20000);

Create Hindsight Replay Memory with Two Goal Conditions

Open Live Script

For this example, create observation and action specifications directly. You can also extract such specifications from your environment.

Create an observation specification for an environment with a single observation channel with eight observations. For this example, assume that the observation channel contains the signals [ $a$ , $x_{m}$ , $y_{m}$ , $θ$ , $x_{g}$ , $y_{g}$ , $θ_{m}$ , $c$ ], where:

$x_{g}$ , $y_{g}$ , and $θ$ are the goal observations.
$x_{m}$ , $y_{m}$ , and $θ_{m}$ are the goal measurements.
$a$ and $c$ are additional observations.

obsInfo = rlNumericSpec([8 1]);

Create a specification for a single action.

actInfo = rlNumericSpec([1 1]);

To create a hindsight replay memory buffer, first define the goal condition information. For this example, define two goal conditions.

The first goal condition depends on $x_{m}$ , $y_{m}$ , $x_{g}$ , and $y_{g}$ as shown in the following equation.

$\sqrt{{(x_{m} - x_{g})}^{2} + {(y_{m} - y_{g})}^{2}} < 0.1$

Specify the information for this goal condition.

goalConditionInfo1 = {1,[2 3], 1, [5 6]};

The first goal condition depends on $θ_{m}$ and $θ$ as shown in the following equation.

${(θ_{m} - θ)}^{2} < 0.01$

Specify the information for this goal condition.

goalConditionInfo2 = {1,4,1,7};

Combine the goal condition information into a cell array.

goalConditionInfo = {goalConditionInfo1, goalConditionInfo2};

Define an is-done function that returns true when the next observation satisfies both goal conditions.

function isdone = hindsightIsDoneFcn3(Observation,Action,NextObservation) 
    NextObservation = NextObservation{1};
    xm = NextObservation(2);
    ym = NextObservation(3);
    xg = NextObservation(5);
    yg = NextObservation(6);
    thetam = NextObservation(7);
    theta = NextObservation(4);
    isdone = sqrt((xm-xg)^2 + (ym-yg)^2) < 0.1 ...
      && (thetam-theta)^2 < 0.01;
end

Define a reward function. For this example, the reward is 1 when the is-done signal is true and –0.01 otherwise.

function reward = hindsightRewardFcn3(Observation,Action,NextObservation)
    isdone = hindsightIsDoneFcn3(Observation,Action,NextObservation);
    if isdone
        reward = 1;
    else
        reward = -0.01;
    end
end

Create a hindsight replay memory buffer with a default maximum length.

buffer = rlHindsightReplayMemory(obsInfo,actInfo, ...
    @hindsightRewardFcn3,@hindsightIsDoneFcn3, ...
    goalConditionInfo);

Generate Experiences from Hindsight Replay Memory

Open Live Script

When you use a hindsight replay memory buffer within your custom agent training loop, you generate experiences at the end of training episode.

$x_{g}$ and $y_{g}$ are the goal observations.
$x_{m}$ and $y_{m}$ are the goal measurements.
$a$ and $c$ are additional observations.

obsInfo = rlNumericSpec([6 1],...
    LowerLimit=0,UpperLimit=[1;5;5;5;5;1]);

Create a specification for a single action.

actInfo = rlNumericSpec([1 1],...
    LowerLimit=0,UpperLimit=10);

goalConditionInfo = {{1,[2 3],1,[4 5]}};

For this example, use hindsightRewardFcn1 as the ground-truth reward function and hindsightIsDoneFcn1 as the termination condition function.

Create the hindsight replay memory buffer.

buffer = rlHindsightReplayMemory(obsInfo,actInfo, ...
    @hindsightRewardFcn1,@hindsightIsDoneFcn1,goalConditionInfo);

As you train your agent, you add experience trajectories to the experience buffer. For this example, add a random experience trajectory of length 10.

for i = 1:10
    exp(i).Observation = {obsInfo.UpperLimit.*rand(6,1)};
    exp(i).Action = {actInfo.UpperLimit.*rand(1)};
    exp(i).NextObservation = {obsInfo.UpperLimit.*rand(6,1)};
    exp(i).Reward = 10*rand(1);
    exp(i).IsDone = 0;
end
exp(10).IsDone = 1;

append(buffer,exp);

At the end of the training episode, you generate hindsight experiences from the last trajectory added to the buffer. Generate experiences specifying the length of the last trajectory added to the buffer.

newExp = generateHindsightExperiences(buffer,10);

For each experience in the final trajectory, the default "final" sampling strategy generates a new experience where it replaces the goals in Observation and NextObservation with the goal measurements from the final experience in the trajectory.

To validate this behavior, first view the final goal measurements from exp.

exp(10).NextObservation{1}(2:3)

ans = 2×1

    0.7277
    0.6803

Next, view the goal values for one of the generated experiences. This value should match the final goal measurement.

newExp(6).Observation{1}(4:5)

ans = 2×1

    0.7277
    0.6803

After generating the new experiences, append them to the buffer.

append(buffer,newExp);

Limitations

Hindsight experience replay does not support agents that use recurrent neural networks.

Algorithms

expand all

Hindsight Experience Replay

Hindsight experience replay is a data augmentation method that you can use for goal-conditioned tasks, where the observation includes both the goal and a goal measurement. In such a goal-conditioned task, the agent reaches the goal when the distance between the goal measurement and the goal is less than a threshold.

However, when the reward is sparse, the agent may not be able to reach the goal during training. In this case, the agent does not get any rewards. Hindsight experience replay generates experience data by replacing the goal with a goal measurement. As a result, the agent can get rewards for improving the goal measurement without actually reaching the goal.

The generated hindsight experiences are created at the end of each training episode by sampling goal measurements from the last trajectory of the episode using one of the following sampling strategies. Specify the sampling strategy using the Strategy property.

Sampling Strategy	`Strategy` Value	Description
Final	`"final"`	Sample one goal measurement from the final observation in the trajectory.
Episode	`"episode"`	Randomly sample multiple goal measurements from the entire trajectory.
Future	`"future"`	For a given time t, sample multiple goal measurements from the end of the trajectory starting at time t+1.

Notation

The hindsight experience generation algorithm described here uses the following notation.

$τ = (e_{1}, e_{2}, \dots, e_{T})$ is an experience trajectory, where T is the trajectory length.
$e_{t} = (S_{t}, A_{t}, R_{t}, S_{t + 1}, D_{t})$ is experience at time t, where:
- S_t is the observation at time t.
- A_t is the action at time t.
- S_t₊₁ is the observation at time t + 1.
- R_t is the reward at time t returned by the reward function $F_{R} (S_{t}, A_{t}, S_{t + 1})$ . To specify f_R, use the RewardFunction property.
- D_t is the episode terminal condition at time t returned by the is-done function $F_{D} (S_{t}, A_{t}, S_{t + 1})$ . To specify f_D, use the IsDoneFunction property.
$S_{t} = (m_{t}, g_{t}, x_{t})$ is the observation at time t, where:
- $g_{t} = (g_{t, 1}, \dots, g_{t, N})$ are goals.
- $m_{t} = (m_{t, 1}, \dots, m_{t, N})$ are corresponding goals measurements.
- $x_{t} = (x_{t, 1}, \dots, x_{t, K})$ are observations that are neither measurements nor goals.

Experience Generation Algorithm

At the end of each training episode, hindsight experiences are generated using the following algorithm. When you use a hindsight experience replay buffer in your own custom agent, you can generate experiences using the generateHindsightExperiences function.

Get the last trajectory τ from the experience buffer.
for t = 1:T:
1. Sample M goal measurements m₁, …, m_M based on th sampling strategy.
  - Final strategy:
    M = 1
    m₁ = m_T₊₁ from S_T₊₁ in e_t
  - Episode strategy:
    M = NumGoalSamples
    Randomly sample goal measurements from S₁, …, S_T₊₁ in τM. Do not sample any specific goal measurement more than once.
  - Future strategy:
    M = min(T – t + 1,NumGoalSamples)
    Randomly sample M goal measurements from S_t₊₁, …, S_T₊₁ in τ. Do not sample any specific goal measurement more than once.
2. Create M empty trajectories τ₁, …, τ_M.
3. for i = 1:T:
  1. In e_t, replace the goal g_t in S_t and g_t in S_t₊₁ with m_i.
    $\begin{array}{l} {g^{'}}_{t} = m_{i} \\ {g^{'}}_{t + 1} = m_{i} \\ {S^{'}}_{t} = (m_{t}, {g^{'}}_{t}, x_{t}) \\ {S^{'}}_{t + 1} = (m_{t + 1}, {g^{'}}_{t + 1}, x_{t + 1}) \end{array}$
  2. Compute a new reward.
    ${R^{'}}_{t} = f_{R} ({S^{'}}_{t}, A_{t}, {S^{'}}_{t + 1})$
  3. Compute a new is-done signal.
    ${D^{'}}_{t} = f_{D} ({S^{'}}_{t}, A_{t}, {S^{'}}_{t + 1})$
  4. Create a new experience sample.
    ${e^{'}}_{i, t} = ({S^{'}}_{t}, A_{t}, {R^{'}}_{t}, {S^{'}}_{t + 1}, {D^{'}}_{t})$
  5. Append the new experience to trajectory τ_i.
4. Append τ₁, …, τ_M to the experience buffer.

References

[1] Andrychowicz, Marcin, Filip Wolski,Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojiech Zaremba. 'Hindsight experience replay'. 31st Conference on Neural Information Processing Systems. Long Beach, CA, USA: 2017.

Version History

Introduced in R2023a

rlHindsightReplayMemory

Description

Creation

Syntax

Description

Input Arguments

`obsInfo` — Observation specifications
specification object | array of specification objects

`actInfo` — Action specifications
specification object | array of specification objects

Properties

`MaxLength` — Maximum buffer length
`10000` (default) | nonnegative integer

`Length` — Number of experiences in buffer
`0` (default) | nonnegative integer

`GoalConditionInfo` — Goal condition information
1-by-N cell array

`RewardFcn` — Reward function
function handle

`IsDoneFcn` — Is-done function
function handle

`Strategy` — Goal measurement sampling strategy
`"final"` (default) | `"episode"` | `"future"`

`NumGoalSamples` — Number of goal measurements
`4` (default) | positive integer

Object Functions

Examples

Create DDPG Agent and Set Hindsight Replay Memory

Create Hindsight Replay Memory with Two Observation Channels

Create Hindsight Replay Memory with Two Goal Conditions

Generate Experiences from Hindsight Replay Memory

Limitations

Algorithms

Hindsight Experience Replay

References

Version History

See Also

Objects

Topics

rlHindsightReplayMemory

Description

Creation

Syntax

Description

Input Arguments

obsInfo — Observation specifications specification object | array of specification objects

actInfo — Action specifications specification object | array of specification objects

Properties

MaxLength — Maximum buffer length 10000 (default) | nonnegative integer

Length — Number of experiences in buffer 0 (default) | nonnegative integer

GoalConditionInfo — Goal condition information 1-by-N cell array

RewardFcn — Reward function function handle

IsDoneFcn — Is-done function function handle

Strategy — Goal measurement sampling strategy "final" (default) | "episode" | "future"

NumGoalSamples — Number of goal measurements 4 (default) | positive integer

Object Functions

Examples

Create DDPG Agent and Set Hindsight Replay Memory

Create Hindsight Replay Memory with Two Observation Channels

Create Hindsight Replay Memory with Two Goal Conditions

Generate Experiences from Hindsight Replay Memory

Limitations

Algorithms

Hindsight Experience Replay

References

Version History

See Also

Objects

Topics

`obsInfo` — Observation specifications
specification object | array of specification objects

`actInfo` — Action specifications
specification object | array of specification objects

`MaxLength` — Maximum buffer length
`10000` (default) | nonnegative integer

`Length` — Number of experiences in buffer
`0` (default) | nonnegative integer

`GoalConditionInfo` — Goal condition information
1-by-N cell array

`RewardFcn` — Reward function
function handle

`IsDoneFcn` — Is-done function
function handle

`Strategy` — Goal measurement sampling strategy
`"final"` (default) | `"episode"` | `"future"`

`NumGoalSamples` — Number of goal measurements
`4` (default) | positive integer