Suddenly the reward value goes wrong

Question

ryunosuke tazawa 2022년 3월 25일

1
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1680279-suddenly-the-reward-value-goes-wrong

We are conducting reinforcement learning for two joint controllers.

Reinforcement learning method is soft actor critic.

The error between the result due to the action of the Agent and the target value is set as the reward value. In other words, the maximum reward value you can get is 0.

At the beginning of learning, it seems to be working well, but at the end of learning, an abnormal reward value is suddenly output.

What's the reason?　In the simulation where the joint controller throws a ball, the reward function is the error between the flight distance of the ball and the target point below.

As a reward, an impossible value is output, is this a learning problem? Or is it another issue?

function s = Ayameperformance_func(Angle1, Angle2, V2)
        X1 = sin(Angle1);         % X coordinate of joint 1
        Y1 = -cos(Angle1);        % Y coordinate of joint 1
        X2 = X1 + sin(Angle1+Angle2);    % X coordinate of joint 2
        Y2 = Y1 - cos(Angle1+Angle2);    % Y coordinate of joint 2
        Ball_V = abs(V2 * cos(Angle2));  % ball's initial velocity
        Ball_Time = sqrt(2*abs(Y2)/9.8); % time for ball flight
        s = X2 + abs(Ball_V)*Ball_Time;   % ball distance
end

%% agent options
agentOptions = rlSACAgentOptions;
agentOptions.SampleTime = Ts;
agentOptions.DiscountFactor = 0.90;                      
agentOptions.TargetSmoothFactor = 1e-3;                   
agentOptions.ExperienceBufferLength = 1e6;                
agentOptions.MiniBatchSize = 64;                        
agentOptions.EntropyWeightOptions.TargetEntropy = -3;     
agentOptions.NumStepsToLookAhead  =1;
agentOptions.ResetExperienceBufferBeforeTraining = false;
agentOptions.TargetUpdateFrequency = 1;
agentOptions.NumGradientStepsPerUpdate =1;
agentOptions.NumWarmStartSteps = 64;
agent = rlSACAgent(actor,[critic1 critic2],agentOptions);
 
 
maxepisodes = 15000;                             
maxsteps = 1e6;                                  
trainingOptions = rlTrainingOptions(...
    'MaxEpisodes',maxepisodes,...
    'MaxStepsPerEpisode',maxsteps,...
    'StopOnError','on',...
    'Verbose',true,...
    'Plots','training-progress',...
    'StopTrainingCriteria','AverageReward',...
    'StopTrainingValue',Inf,...
    'ScoreAveragingWindowLength',10); 
trainingOptions.UseParallel = true;
trainingOptions.ParallelizationOptions.Mode = 'sync';
trainingOptions.ParallelizationOptions.StepsUntilDataIsSent  = 50;
trainingOptions.ParallelizationOptions.DataToSendFromWorkers =  'Experiences';