Train PPO Agent for Automatic Parking Valet
This example demonstrates the design of a hybrid controller for an automatic search and parking task. The hybrid controller uses model predictive control (MPC) to follow a reference path in a parking lot and a trained reinforcement learning (RL) agent to perform the parking maneuver.
The automatic parking algorithm in this example executes a series of maneuvers while simultaneously sensing and avoiding obstacles in tight spaces. It switches between an adaptive MPC controller and an RL agent to complete the parking maneuver. The MPC controller moves the vehicle at a constant speed along a reference path while an algorithm searches for an empty parking spot. When a spot is found, the RL Agent takes over and executes a pretrained parking maneuver. Prior knowledge of the environment (the parking lot) including the locations of the empty spots and parked vehicles is available to the controllers.
The parking lot is represented by the
ParkingLot class, which stores information about the ego vehicle, empty parking spots, and static obstacles (parked cars). Each parking spot has a unique index number and an indicator light that is either green (free) or red (occupied). Parked vehicles are represented in black.
ParkingLot object with a free spot at location 7.
freeSpotIdx = 7; map = ParkingLot(freeSpotIdx);
Specify an initial pose for the ego vehicle. The target pose is determined based on the first available free spot as the vehicle navigates the parking lot.
egoInitialPose = [20, 15, 0];
Compute the target pose for the vehicle using the
createTargetPose function. The target pose corresponds to the location in
egoTargetPose = createTargetPose(map,freeSpotIdx)
egoTargetPose = 1×3 47.7500 4.9000 -1.5708
The parking algorithm uses camera and lidar sensors to gather information from the environment.
The field of view of a camera mounted on the ego vehicle is represented by the area shaded in green in the following figure. The camera has a field of view bounded by and a maximum measurement depth of 10 m.
As the ego vehicle moves forward, the camera module senses the parking spots that fall within the field of view and determines whether a spot is free or occupied. For simplicity, this action is implemented using geometrical relationships between the spot locations and the current vehicle pose. A parking spot is within the camera range if and , where is the distance to the parking spot and is the angle to the parking spot.
The reinforcement learning agent uses lidar sensor readings to determine the proximity of the ego vehicle to other vehicles in the environment. The lidar sensor in this example is also modeled using geometrical relationships. Lidar distances are measured along 12 line segments that radially emerge from the center of the ego vehicle. When a lidar line intersects an obstacle, it returns the distance of the obstacle from the vehicle. The maximum measurable lidar distance along any line segment is 6 m.
Auto Parking Valet Model
The parking valet model, including the controllers, ego vehicle, sensors, and parking lot, is implemented in Simulink®.
Load the auto parking valet parameters.
Open the Simulink model.
mdl = "rlAutoParkingValet"; open_system(mdl)
The ego vehicle dynamics in this model are represented by a single-track bicycle model with two inputs: vehicle speed (m/s) and steering angle (radians). The MPC and RL controllers are placed within Enabled Subsystem blocks that are activated by signals representing whether the vehicle has to search for an empty spot or execute a parking maneuver. The enable signals are determined by the Camera algorithm within the Vehicle Mode subsystem. Initially, the vehicle is in search mode and the MPC controller tracks the reference path. When a free spot is found, park mode is activated and the RL agent executes the parking maneuver.
Adaptive Model Predictive Controller
Create the adaptive MPC controller object for reference trajectory tracking using the
createMPCForParking script. For more information on adaptive MPC, see Adaptive MPC (Model Predictive Control Toolbox).
Reinforcement Learning Environment
The environment for training the RL agent is the region shaded in red in the following figure. Due to symmetry in the parking lot, training within this region is sufficient for the policy to adjust to other regions after applying appropriate coordinate transformations to the observations (these transformations are performed in the Observation and Reward group inside the RL Controller block). Using this smaller training region significantly reduces training time compared to training over the entire parking lot.
For this environment:
The training region is a 22.5 m x 20 m space with the target spot at its horizontal center.
The observations are the position errors and of the ego vehicle with respect to the target pose, the sine and cosine of the true heading angle , and the lidar sensor readings.
The vehicle speed during parking is a constant 2 m/s.
The action signals are discrete steering angles that range between +/- 45 degrees in steps of 15 degrees.
The vehicle is considered parked if the errors with respect to target pose are within specified tolerances of ±0.75 m (position) and ±10 degrees (orientation).
The episode terminates if the ego vehicle goes out of the bounds of the training region, collides with an obstacle, or parks successfully.
The reward provided at time t, is:
Here, , , and are the position and heading angle errors of the ego vehicle from the target pose, and is the steering angle. (0 or 1) indicates whether the vehicle has parked and (0 or 1) indicates if the vehicle has collided with an obstacle at time .
The coordinate transformations on vehicle pose observations for different parking spot locations are as follows:
1-14: no transformation
Create the observation and action specifications for the environment.
numObservations = 16; obsInfo = rlNumericSpec([numObservations 1]); obsInfo.Name = "observations"; steerMax = pi/4; discreteSteerAngles = -steerMax : deg2rad(15) : steerMax; actInfo = rlFiniteSetSpec(num2cell(discreteSteerAngles)); actInfo.Name = "actions"; numActions = numel(actInfo.Elements);
Create the Simulink environment interface, specifying the path to the RL Agent block.
blk = mdl + "/RL Controller/RL Agent"; env = rlSimulinkEnv(mdl,blk,obsInfo,actInfo);
Specify a reset function for training. The
autoParkingValetResetFcn function resets the initial pose of the ego vehicle to random values at the start of each episode.
env.ResetFcn = @autoParkingValetResetFcn;
For more information on creating Simulink environments, see
rlSimulinkEnv (Reinforcement Learning Toolbox).
The RL agent in this example is a proximal policy optimization (PPO) agent with a discrete action space. To learn more about PPO agents, see Proximal Policy Optimization (PPO) Agents (Reinforcement Learning Toolbox).
Set the random seed generator for reproducibility.
PPO agents use a parametrized value function approximator to estimate the value of the policy. A value-function critic takes the current observation as input and returns a single scalar as output (the estimated discounted cumulative long-term reward for following the policy from the state corresponding to the current observation).
To model the parametrized value function within the critic, use a neural network with one input layer (which receives the content of the observation channel, as specified by
obsInfo) and one output layer (which returns the scalar value).
Define the network as an array of layer objects.
criticNet = [ featureInputLayer(numObservations) fullyConnectedLayer(128) reluLayer fullyConnectedLayer(128) reluLayer fullyConnectedLayer(128) reluLayer fullyConnectedLayer(1) ];
Convert the network to a
dlnetwork object, and display the number of parameters.
criticNet = dlnetwork(criticNet); summary(criticNet)
Initialized: true Number of learnables: 35.3k Inputs: 1 'input' 16 features
Create the critic for the PPO agent. For more information, see
rlValueFunction (Reinforcement Learning Toolbox) and
critic = rlValueFunction(criticNet,obsInfo);
Policy gradient agents use a parametrized stochastic policy, which for discrete action spaces is implemented by a discrete categorical actor. This actor takes an observation as input and returns as output a random action sampled (among the finite number of possible actions) from a categorical probability distribution.
To model the parametrized policy within the actor, use a neural network with one input layer (which receives the content of the environment observation channel, as specified by
obsInfo) and one output layer. The output layer must return a vector of probabilities for each possible action, as specified by
Define the network as an array of layer objects
actorNet = [ featureInputLayer(numObservations) fullyConnectedLayer(128) reluLayer fullyConnectedLayer(128) reluLayer fullyConnectedLayer(numActions) softmaxLayer ];
Convert the network to a
dlnetwork object, and display the number of parameters.
actorNet = dlnetwork(actorNet); summary(actorNet)
Initialized: true Number of learnables: 19.5k Inputs: 1 'input' 16 features
Create a discrete categorical actor for the PPO agent. For more information, see
rlDiscreteCategoricalActor (Reinforcement Learning Toolbox).
actor = rlDiscreteCategoricalActor(actorNet,obsInfo,actInfo);
Specify training options for the critic and the actor using
rlOptimizerOptions (Reinforcement Learning Toolbox).
criticOptions = rlOptimizerOptions( ... LearnRate=1e-3, ... GradientThreshold=1); actorOptions = rlOptimizerOptions( ... LearnRate=2e-4, ... GradientThreshold=1);
Specify the agent options using
rlPPOAgentOptions (Reinforcement Learning Toolbox), include the options for the actor and the critic.
agentOpts = rlPPOAgentOptions(... SampleTime=Ts,... ActorOptimizerOptions=actorOptions,... CriticOptimizerOptions=criticOptions,... ExperienceHorizon=200,... ClipFactor=0.2,... EntropyLossWeight=0.01,... MiniBatchSize=64,... NumEpoch=3,... AdvantageEstimateMethod="gae",... GAEFactor=0.95,... DiscountFactor=0.998);
According to these training options, the agent collects experiences until it reaches experience horizon of 200 steps or the episode terminates and then trains from mini-batches of 64 experiences for three epochs. An objective function clip factor of 0.2 improves training stability and a discount factor value of 0.998 encourages long term rewards. Variance in critic the output is reduced by the generalized advantage estimate method with a GAE factor of 0.95.
Create the agent using the actor, the critic, and the agent options object.
agent = rlPPOAgent(actor,critic,agentOpts);
For this example, you train the agent for a maximum of 10000 episodes, with each episode lasting a maximum of 200 time steps. The training terminates when the maximum number of episodes is reached or the average reward over 100 episodes exceeds 100.
Specify the options for training using an
trainOpts = rlTrainingOptions(... MaxEpisodes=10000,... MaxStepsPerEpisode=200,... ScoreAveragingWindowLength=200,... Plots="training-progress",... StopTrainingCriteria="AverageReward",... StopTrainingValue=80);
Train the agent using the
train (Reinforcement Learning Toolbox) function. Training this agent is a computationally intensive process that takes several minutes to complete. To save time while running this example, load a pretrained agent by setting
false. To train the agent yourself, set
doTraining = false; if doTraining trainingStats = train(agent,env,trainOpts); else load('rlAutoParkingValetAgent.mat','agent'); end
Simulate the model to park the vehicle in the free parking spot. To simulate the vehicle parking in different locations, change the free spot location in the following code.
freeSpotIdx = 7; % free spot location sim(mdl);
The vehicle reaches the target pose within the specified error tolerances of +/- 0.75 m (position) and +/-10 degrees (orientation).
To view the ego vehicle position and orientation, open the Ego Vehicle Pose scope.
train (Reinforcement Learning Toolbox)