generateHindsightExperiences
Generate hindsight experiences from hindsight experience replay buffer
Since R2023a
Description
generates hindsight experiences from the last trajectory added to the specified hindsight
experience replay memory buffer.experience = generateHindsightExperiences(buffer,trajectoryLength)
Examples
When you use a hindsight replay memory buffer within your custom agent training loop, you generate experiences at the end of training episode.
Create an observation specification for an environment with a single observation channel with six observations. For this example, assume that the observation channel contains the signals [, , , , , ], where:
and are the goal observations.
and are the goal measurements.
and are additional observations.
obsInfo = rlNumericSpec([6 1],...
LowerLimit=0,UpperLimit=[1;5;5;5;5;1]);Create a specification for a single action.
actInfo = rlNumericSpec([1 1],...
LowerLimit=0,UpperLimit=10);To create a hindsight replay memory buffer, first define the goal condition information. Both the goals and goal measurements are in the single observation channel. The goal measurements are in elements 2 and 3 of the observation channel and the goals are in elements 4 and 5 of the observation channel.
goalConditionInfo = {{1,[2 3],1,[4 5]}};For this example, use hindsightRewardFcn1 as the ground-truth reward function and hindsightIsDoneFcn1 as the termination condition function.
Create the hindsight replay memory buffer.
buffer = rlHindsightReplayMemory(obsInfo,actInfo, ...
@hindsightRewardFcn1,@hindsightIsDoneFcn1,goalConditionInfo);As you train your agent, you add experience trajectories to the experience buffer. For this example, add a random experience trajectory of length 10.
for i = 1:10 exp(i).Observation = {obsInfo.UpperLimit.*rand(6,1)}; exp(i).Action = {actInfo.UpperLimit.*rand(1)}; exp(i).NextObservation = {obsInfo.UpperLimit.*rand(6,1)}; exp(i).Reward = 10*rand(1); exp(i).IsDone = 0; end exp(10).IsDone = 1; append(buffer,exp);
At the end of the training episode, you generate hindsight experiences from the last trajectory added to the buffer. Generate experiences specifying the length of the last trajectory added to the buffer.
newExp = generateHindsightExperiences(buffer,10);
For each experience in the final trajectory, the default "final" sampling strategy generates a new experience where it replaces the goals in Observation and NextObservation with the goal measurements from the final experience in the trajectory.
To validate this behavior, first view the final goal measurements from exp.
exp(10).NextObservation{1}(2:3)ans = 2×1
0.7277
0.6803
Next, view the goal values for one of the generated experiences. This value should match the final goal measurement.
newExp(6).Observation{1}(4:5)ans = 2×1
0.7277
0.6803
After generating the new experiences, append them to the buffer.
append(buffer,newExp);
Input Arguments
Hindsight experience buffer, specified as one of the following replay memory objects.
Length of last trajectory in buffer, specified as a positive integer.
Output Arguments
Experiences sampled from the buffer, returned as a structure with the following fields.
Observation, returned as a cell array with length equal to the number of
observation specifications specified when creating the buffer. Each element of
Observation contains a
DO-by-batchSize-by-SequenceLength
array, where DO is the dimension of the
corresponding observation specification.
Agent action, returned as a cell array with length equal to the number of
action specifications specified when creating the buffer. Each element of
Action contains a
DA-by-batchSize-by-SequenceLength
array, where DA is the dimension of the
corresponding action specification.
Reward value obtained by taking the specified action from the observation,
returned as a 1-by-1-by-SequenceLength array.
Next observation reached by taking the specified action from the observation,
returned as a cell array with the same format as
Observation.
Termination signal, returned as a
1-by-1-by-SequenceLength array of integers. Each element of
IsDone has one of the following values.
0— This experience is not the end of an episode.1— The episode terminated because the environment generated a termination signal.2— The episode terminated by reaching the maximum episode length.
Length of last trajectory in experience buffer, specified as a positive integer.
Version History
Introduced in R2023a
See Also
Functions
Objects
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
웹사이트 선택
번역된 콘텐츠를 보고 지역별 이벤트와 혜택을 살펴보려면 웹사이트를 선택하십시오. 현재 계신 지역에 따라 다음 웹사이트를 권장합니다:
또한 다음 목록에서 웹사이트를 선택하실 수도 있습니다.
사이트 성능 최적화 방법
최고의 사이트 성능을 위해 중국 사이트(중국어 또는 영어)를 선택하십시오. 현재 계신 지역에서는 다른 국가의 MathWorks 사이트 방문이 최적화되지 않았습니다.
미주
- América Latina (Español)
- Canada (English)
- United States (English)
유럽
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)