Epsilon-greedy Algorithm in RL DQN

Question

ches 2020년 11월 30일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/668363-epsilon-greedy-algorithm-in-rl-dqn

댓글: Cecilia S. 2021년 6월 9일

Hello,

I'm currently training a DQN agent for my RL problem. As the training progresses, I can see that the episode reward, running average and Q0 converge to (approximately) the same value, which is a good sign. However, I am uncertain if indeed it is able to find the optimal policy, or it just gets stuck in a local minimum.

With this, I have the following questions about exploration using the epsilon-greedy algorithm (with configurable parameters in rlDQNAgentOptions).

1. Does the epsilon decay every time step and continuously over all episodes (meaning, it does not reset to epsilon-max at the start of every new episode)?

2. Do the number of time steps per episode and the total number of episodes have a direct impact on the exploration process? Or are there other parameters which affect exploration besides the epsilon parameters?

3. How is the Q0 estimate calculated? Is it solely based on the output of my DNN policy representation?

4. How is the episode reward calculated? My understanding is that, it is just the sum of the actual rewards for all time steps within an episode.

Thank you in advance for your help! :)

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Emmanouil Tzorakoleftherakis 2020년 11월 30일

2
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/668363-epsilon-greedy-algorithm-in-rl-dqn#answer_560418

Hello,

First off, RL typically solves a complex nonlinear optimization problem. So at the end of the day, you will most certainly not get a global solution, but a local one. So the question becomes how good that local solution is compared to some other one.

Some comments to your questions:

1. I believe it does not reset after an episode, yes.

2.Exploration for DQN in Reinforcement Learning Toolbox is primarily determined by the epsilon. Of course, given that this is still a trial-and-error method in a way, number of steps and episodes may play a role in how well you learn but I don't think you have much control over it. For example, if during an episode an agent is at a good spot and exploring in a part of the state space that is critical, you don't want to hit the maximum number of steps and terminate the episode. But as I said I don't think you have much control over that. What you can do is make sure to reset and randomize your environment state so that the agent gets to explore different parts of the state space.

3. I believe so. It is using the observation values at the beginning of the episode to calculate how much potential that state has.

4.Correct. There is a distinction between the reward you see in the Episode manager and the reward used by the DQN algorithm, in that the latter also considers the discount factor.

Hope that helps

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

ches 2020년 12월 1일

편집: ches 2020년 12월 1일

Hello,

Thanks a lot for your explanation. It is very helpful.

>>What you can do is make sure to reset and randomize your environment state so that the agent gets to explore different parts of the state space.

By "randomize your environment", would this be related to the epsilon-greedy? Since the random selection of action triggers the environment to transition into a new state, and therefore, allows for exploration of the state and action spaces. Just to describe what's happening during training, I can see that some actions are better since they result to higher reward (at a given time step). However, towards the end of the training, the algorithm converges/settles for an action that yields a lower reward. This is why I'm uncertain whether I'm doing the training correctly. Is there any reason for this?

Also, just to clarify the difference between the Q0 estimate and the (actual) episode reward seen in the Episode manager, the former is calculated by the DQN/agent algorithm, while the latter is determined by the environment (independent of the agent algorithm). Is my understanding correct?

Emmanouil Tzorakoleftherakis 2020년 12월 1일

편집: Emmanouil Tzorakoleftherakis 2020년 12월 1일

Randomizing the environment serves a few different purposes:

1) You avoid overfitting so the policy you get is more robust and performs well across a variety of conditions.

2) You give the agent the opportunity to visit/explore places it has never been to previously. That would help you explore different (local) solutions depending on your exploration strategy (epsilong-greedy or other).

I would not directly observe what happens with the reward at individual time steps since there is some randomness that is introduced by your exploration. Also, the agent tries to maximize long-term reward, so it cares more about what happens in the long run compared to a single time step.

What would be concerning is if the average episode reward converges to a value that is smaller compared to the average reward collected at some point in the past. This means that the agent stopped exploring (which is bound to happen at some point otherwise exploration would dominate exploitation). If you see that happening, you can make the agent explore for longer periods by changing exploration settings such as epsilonmin and epsilondecay.

Your understanding of Q0 is correct (effectively you are doing a single inference with the current version of the critic). The episode reward is determined by the interactions between the agent and the environment, so I wouldn't say it's independent of the agent.

ches 2020년 12월 3일

Hi,

Thanks again for your explanation. It definitely cleared my doubts. I will follow your advice with the exploration settings. Thank you! :)

Cecilia S. 2021년 6월 9일

Hello, I have a question concerning the episode reward:

Why is it that it does not consider the discount factor? How is it calculated exactly? I have a system that has had some VERY negative rewards that I cannot account for and I would like to have more information on it

댓글을 달려면 로그인하십시오.

Epsilon-greedy Algorithm in RL DQN

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

Epsilon-greedy Algorithm in RL DQN

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 4 이전 댓글 2개 표시이전 댓글 2개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기