Main Content

rlTrainingFromDataOptions

Options to train reinforcement learning agents using existing data

Since R2023a

    Description

    Use an rlTrainingFromDataOptions object to specify options to train an off-policy agent from existing data. Training options include the maximum number of epochs to train, criteria for stopping training and criteria for saving agents. To train the agent using the specified options, pass this object to trainFromData.

    For more information on training agents, see Train Reinforcement Learning Agents.

    Creation

    Description

    tfdOpts = rlTrainingFromDataOptions returns a default options set to train an off-policy agent offline, from existing data.

    tfdOpts = rlTrainingOptions(Name=Value) creates the training option set tfdOpts and sets its properties using one or more name-value arguments.

    example

    Properties

    expand all

    Maximum number of epochs to train the agent, specified as a positive integer. Each epoch has a fixed number of learning steps specified by NumStepsPerEpoch. Regardless of other criteria for termination, training terminates after MaxEpochs.

    Example: MaxEpochs=500

    Number of steps to run per epoch, specified as a positive integer.

    Example: NumStepsPerEpoch=1000

    Buffer update period, specified as a positive integer. For example, if the value of this option is 1 (default), then the buffer updates every epoch, if it is 2 the buffer updates every other epoch, and so on. Note that the experience buffer is not updated if it already contains all the available data.

    Example: ExperienceBufferUpdateFrequency=2

    Number of experiences appended per buffer update, specified as a positive integer or empty matrix. If the value of this option is left empty (default) then, at training time, it is automatically set to half the length of the experience buffer used by the agent.

    Example: NumExperiencesPerExperienceBufferUpdate=5e5

    Batch of observations used to compute Q values, specified as an 1-by-N cell array, where N is the number of observation channels. Each cell must contain a batch of observations, along the batch dimension, for the corresponding observation channel. For example, if you have two observation channels carrying a 3-by-1 vector and a scalar, a batch of 10 random observations is {rand(3,1,10),rand(1,1,10)}.

    If the value of this option is left empty (default) then, at training time, it is automatically set to a cell array in which each element corresponding to an observation channel is an array of zeros having the same dimensions of the observation, without any batch dimension.

    Example: QValueObservations={rand(3,1,10),rand(1,1,10)}

    Window length for averaging Q-values, specified as a scalar. One termination and one saving options are expressed in terms of average Q-values. For these options, the average is calculated over the last ScoreAveragingWindowLength epochs.

    Example: ScoreAveragingWindowLength=10

    Training termination condition, specified as one of the following strings:

    • "none" — Stop training after the agent is trained for the number of epochs specified in MaxEpochs.

    • "QValue" — Stop training when the average Q-value (computed using the current critic and the observations specified in QValueObservations) over the last ScoreAveragingWindowLength epochs equals or exceeds the value specified in the StopTrainingValue option.

    Example: StopTrainingCriteria="QValue"

    Critical value of the training termination condition, specified as a scalar. Training ends when the termination condition specified by the StopTrainingCriteria option equals or exceeds this value.

    For instance, if StopTrainingCriteria is "QValue" and StopTrainingValue is 50, then training terminates when the moving average Q-value (computed using the current critic and the observations specified in QValueObservations) over the number of epochs specified in ScoreAveragingWindowLength equals or exceeds 50.

    Example: StopTrainingValue=50

    Condition for saving the agent during training, specified as one of the following strings:

    • "none" — Do not save any agents during training.

    • "EpochFrequency" — Save the agent when the number of epochs is an integer multiple of the value specified in the SaveAgentValue option.

    • "QValue" — Save the agent when the when the average Q-value (computed using the current critic and the observations specified in QValueObservations) over the last ScoreAveragingWindowLength epochs equals or exceeds the value specified in SaveAgentValue.

    Set this option to store candidate agents that perform in term of Q-value, or just to save agent at a fixed rate. For instance, if SaveAgentCriteria is "EpochFrequency" and SaveAgentValue is 5, then the agent is saved every five epochs.

    Example: SaveAgentCriteria="EpochFrequency"

    Critical value of the condition for saving the agent, specified as a scalar.

    Example: SaveAgentValue=10

    Folder name for saved agents, specified as a string or character vector. The folder name can contain a full or relative path. When an episode occurs in which the conditions specified by the SaveAgentCriteria and SaveAgentValue options are satisfied, the software saves the current agent in a MAT-file in this folder. If the folder does not exist, the training function creates it. When SaveAgentCriteria is "none", this option is ignored and no folder is created.

    Example: SaveAgentDirectory = pwd + "\run1\Agents"

    Option to display training progress at the command line, specified as the logical values false (0) or true (1). Set to true to write information from each training episode to the MATLAB® command line during training.

    Example: Verbose=true

    Option to display training progress with Reinforcement Learning Training Monitor, specified as "training-progress" or "none". By default, calling train opens Reinforcement Learning Training Monitor, which graphically and numerically displays information about the training progress, such as the reward for each episode, average reward, number of episodes, and total number of steps. For more information, see train. To turn off this display, set this option to "none".

    Example: Plots="none"

    Object Functions

    trainFromDataTrain off-policy reinforcement learning agent using existing data

    Examples

    collapse all

    Create an options set to train a reinforcement learning agent offline, from an existing dataset.

    Set the maximum number of epochs to 2000 and the maximum number of steps per epoch to 1000. Do not set any criteria to stop the training before 1000 epochs. Also, display training progress on the command line instead of using Reinforcement Learning Training Monitor.

    tfdOpts = rlTrainingFromDataOptions(...
        MaxEpochs=2000,...
        NumStepsPerEpoch=1000,...
        Verbose=true,...
        Plots="none")
    tfdOpts = 
      rlTrainingFromDataOptions with properties:
    
                                      MaxEpochs: 2000
                               NumStepsPerEpoch: 1000
                ExperienceBufferUpdateFrequency: 1
        NumExperiencesPerExperienceBufferUpdate: []
                             QValueObservations: []
                     ScoreAveragingWindowLength: 5
                           StopTrainingCriteria: "none"
                              StopTrainingValue: "none"
                              SaveAgentCriteria: "none"
                                 SaveAgentValue: "none"
                             SaveAgentDirectory: "savedAgents"
                                        Verbose: 1
                                          Plots: "none"
    
    

    Alternatively, create a default options set and use dot notation to change some of the values.

    trainOpts = rlTrainingFromDataOptions;
    trainOpts.MaxEpochs = 2000;
    trainOpts.NumStepsPerEpoch = 1000;
    trainOpts.Verbose = true;
    trainOpts.Plots = "training-progress";
    
    trainOpts
    trainOpts = 
      rlTrainingFromDataOptions with properties:
    
                                      MaxEpochs: 2000
                               NumStepsPerEpoch: 1000
                ExperienceBufferUpdateFrequency: 1
        NumExperiencesPerExperienceBufferUpdate: []
                             QValueObservations: []
                     ScoreAveragingWindowLength: 5
                           StopTrainingCriteria: "none"
                              StopTrainingValue: "none"
                              SaveAgentCriteria: "none"
                                 SaveAgentValue: "none"
                             SaveAgentDirectory: "savedAgents"
                                        Verbose: 1
                                          Plots: "training-progress"
    
    

    You can now use trainOpts as an input argument to the trainFromData command.

    Version History

    Introduced in R2023a