Reinforcement Learning -- Rocket Lander

Question

0 개 추천

Th "Rocket Lander" example does not converge with the stated hyperparameters. Someone was helpful enough to give me the following values:

learning rate = 1e-4

clip factor = 0.1

mini-batch size = 128

Although these values work better, the algorithm still does not converge. After about 14,000 episodes there are many successful landings, but they are interspersed with violent crash landings. Anybody at MathWorks or otherwise have any suggestions? Thank you.

Averill M. LAW

댓글 수: 7
이전 댓글 5개 표시 이전 댓글 5개 숨기기

Averill Law 2020년 5월 26일

Hi Emmanouil,

Presumably, the value of epsilon is decayed during an episode according to the formula in the documentation. Is it decayed after each state transition? What happens to epsilon at the beginning of a new episode? Thank you very much for your assistance.

Best regards,

Averill M. Law

Emmanouil Tzorakoleftherakis 2020년 5월 27일

Hi Averill,

I am not sure why you do not get convergence, but comparing the screenshot you sent with the one that is in the live script I sent, you can clearly see that the Episode rewards are on a different scale (~7000 vs ~300-400). I would suggest to start fresh by deleting temp files and download and run the example I sent from below. You shouldn't need to change the clip factor or any other hyperparameter in the example below.

The reason we made changes to the example is that some of the latest underlying optimizations changed the numerical behavior of training (which is why the example was not converging), so we made these changes to get a more robust result. The reward is typically the most important thing you need to get right to be able to get the desired behavior. This is most of the time the first thing that needs to be retuned if you don't get the right behavior.

In terms of epsilon, I think you may be confusing epsilon greedy exploration, which is used e.g. in DQN and Q learning with the clip rate epsilon in PPO (please correct me if I am wrong). The former does indeed change with time in the current implementation but the latter is fixed. They share the same letter so it may be confusing but the two hyperparameters serve very different purposes. PPO does not use the "exploration epsilon" because it handles exploration through the stochastic nature of the actor as well as through an additional entropy term in the objective. PPO uses 'clip factor epsilon' to control the amount of change that happens to the neural network weights.

Hope that helps.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

Emmanouil Tzorakoleftherakis 2020년 5월 20일

0 개 추천

RocketLanderV2.zip

Hi Averill,

Here is a version that converges in ~18-20k episodes - thank you for pointing out that this example was not converging properly. This version will also be included in a few weeks in the next R2020a update. We changed some of the hyperparameters as well as the reward signal.

For the stochastic grid world I don't think we have a published example (if I recall correctly). If you used the basic grid world example as reference, you will likely need to make some changes over there as well.

Hope that helps

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Answer 2

Averill Law 2020년 5월 22일

0 개 추천

Hi Emmanouil,

You are absolutely right about the "Waterfall Grid World" examples. There is a basic discussion but not complete programs. I got the Deterministic version to converge, but not the Stochastic version.

I look forward to hearing from you further on the "Rocket Lander" example, as per my comments earlier today. Using clip rate = 0.125 I got convergence in 13,065 episodes. Thank you.

Averill M. Law

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Answer 3

Averill Law 2020년 6월 1일

0 개 추천

Hi Emmanouil,

Your new version of the "Rocket Lander" example does not work on my computer. At line 25 I get the error message:

Unrecognized function or variable 'RLValueRepresentation'

I do not know how to delete Temp files.

I was, in fact, interested in the epsilon for an epsilon-greedy policy when using Q-learning. As episode 1 ends and episode 2 begins, does the value of epsilon continue to be decayed? Is it correct that the state S is NOT reinitialized at this time?

Thank you very much for your assistance.

Averill M. Law

댓글 수: 7
이전 댓글 5개 표시 이전 댓글 5개 숨기기

Averill Law 2020년 6월 3일

Hi Emmanouil,

If I understand correctly, you will call me at 520-795-6265 on Thursday at 11AM Arizona time.

I'm totally amazed that you want your whole team to discuss the convergence of the Rocket Lander example with me. Thank you.

Averill M. Law

520-795-6265

Emmanouil Tzorakoleftherakis 2020년 6월 3일

Of course. Talk to you tomorrow,

Emmanouil

댓글을 달려면 로그인하십시오.

Reinforcement Learning -- Rocket Lander

댓글 수: 7
이전 댓글 5개 표시 이전 댓글 5개 숨기기

채택된 답변

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

추가 답변 (2개)

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글 수: 7
이전 댓글 5개 표시 이전 댓글 5개 숨기기

카테고리

제품

릴리스

태그

Community Treasure Hunt

Reinforcement Learning -- Rocket Lander

댓글 수: 7 이전 댓글 5개 표시 이전 댓글 5개 숨기기

채택된 답변

댓글 수: 0 이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

추가 답변 (2개)

댓글 수: 0 이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글 수: 7 이전 댓글 5개 표시 이전 댓글 5개 숨기기

카테고리

제품

릴리스

태그

참고 항목

Community Treasure Hunt

댓글 수: 7
이전 댓글 5개 표시 이전 댓글 5개 숨기기

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글 수: 7
이전 댓글 5개 표시 이전 댓글 5개 숨기기