Reward in Reinforcement Learning Designer App not matching actual reward

조회 수: 3 (최근 30일)
Julian Hoffmann
Julian Hoffmann 2021년 10월 8일
답변: Meet 2024년 11월 18일 6:05
Hi,
I use the Matlab RL toolbox to find solutions of a graph theory problem. In short, it's about how to find the best order to pick cars at a crossing if they were driving automated and we knew all the relevant parameters (where they come from, where they want to go, speed). So out of that we made an adjacence matrix and now we want to find out which is the order with the fewest total time.
I made an environment using rlFunctionEnv, including working Step and Reset Functions. If a restriction was broken, i.e. if a car that is currently not in the first line was chosen or if a car was chosen more than once, I gave penalties, or negative rewards. If not, I gave a postive reward depending on the value in the adjacence matrix. And if all cars were put in an order without taking a car more than once, there was a big reward for success at the end.
Then I used the RL Designer App to make an actor (i.e. using DQN).
  1. My problem now is, that the rewards apparenlty are not fully transmitted. Because when I check the dashboard in the App, it always shows one step less per episode that it should (i.e. if there are 6 cars, it shows only 5 steps per episode). And the rewards are not the same as how I told the code how to calcuate them. When I saved the reward in the code of the environment manually at the end of an episode in a csv file, there it is shown correctly. So the code seems to work, it just doesn't work the same in the App. There is always the reward of one step missing. And I think that's a big problem because I think the reward shown there is what the learning of the actor in the end is based on. So my question is: does anybody know why it is so and how to solve that problem?
  2. Another problem then is that at in training after some time convergence is reached, though not to the highest values but to almost the worst values, i.e. when a very negative overall reward is given because always the same car was chosen. I tried all kind of variations of epsilon (decay) and learning rate. How can that be solved?
I hope you understand my problems. I'm especially interested in ideas for my first problem as it seems that is a specific problem with the RL Designer App for which I can't find hints at any other place.
Thanks alot in advance! If something wasn't clear, just ask :)

답변 (1개)

Meet
Meet 2024년 11월 18일 6:05
Hi Julian,
For the issue with the dashboard showing one step less, you might try the following:
  1. Ensure that the environment object you have created is correctly registered and used within the RL Designer App.
  2. Check that the termination condition in your "Step" function is set correctly to allow for all cars to be processed before an episode ends.
Regarding the convergence to suboptimal values, consider these suggestions:
  1. Check that your reward scheme is encouraging the desired behavior. If the penalties for incorrect actions are too severe or if the rewards for correct actions are not sufficiently encouraging, the agent might learn a suboptimal policy. You might need to adjust the reward values to better guide the learning process.
  2. Ensure that your training duration is long enough for the agent to explore the state space adequately.
Hope this helps!!

카테고리

Help CenterFile Exchange에서 Training and Simulation에 대해 자세히 알아보기

제품


릴리스

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by