Exploration in Deep Reinforcement Learning

Question

Bhooshan V 2022년 4월 14일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1696195-exploration-in-deep-reinforcement-learning

편집: Ayush Aniket 2025년 5월 6일

I am trying to reimplement REINFORCE algorithm with custom training loop for a specific problem. To the best of my knowledge, I have not come across any exploration technique in the given example. How do I implement exploration/exploitation in case of neural networks?

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

ROBERT HOLCOMB 2023년 2월 14일

I have the same question

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Ayush Aniket 2025년 5월 6일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1696195-exploration-in-deep-reinforcement-learning#answer_1564740

편집: Ayush Aniket 2025년 5월 6일

MATLAB Online에서 열기

In the provided example explicit exploration techniques are not directly implemented. However, exploration is inherently handled by the policy architecture and its training process as explained below:

1. Stochastic Policy (Softmax Output) - The actor network ends with a softmaxLayer, creating a categorical probability distribution over actions for each state, which is used to create a policy using rlStochasticActorPolicy function. At each step, the action is sampled from this probability distribution. This means:

Actions with higher probabilities are more likely to be chosen (exploitation).
Less probable actions can still be chosen (exploration).

This sampling mechanism is the main source of exploration in REINFORCE and policy gradient methods.

2. Entropy Regularization - In the custom loss function ( actorLossFunction ), an entropy loss term is added:

entropyLoss = -sum(actionProbabilities.*actionLogProbabilities.*mask, "all");
loss = (pgLoss + 1e-4*entropyLoss)/(sum(mask));

This term encourages the policy to maintain uncertainty (higher entropy) in its action distribution, which naturally promotes exploration, especially early in training. As training progresses, the policy becomes more confident (lower entropy) as it learns which actions yield higher rewards (exploitation).

3. Switch to Exploitation After Training - During simulation, the policy parameter is set:

policy.UseMaxLikelihoodAction = true;

This means the policy always selects the most likely (greedy) action—pure exploitation for evaluation.

If you want to further control exploration/exploitation, here are some options:

Increase entropy regularization (increase coefficient from 1e-4 to a higher value) to promote more exploration.
Decrease entropy regularization for more exploitation.
Implement ε-greedy - With probability ε, select a random action; with probability 1-ε, sample from the policy.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Exploration in Deep Reinforcement Learning

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

답변 (1개)

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

Exploration in Deep Reinforcement Learning

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

답변 (1개)

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기