To choose an action, is it correct to compute the value of successor state or do we need to compute value of states in the entire path till end state?

조회 수: 2 (최근 30일)
While selecting an action , that action is chosen whose Q(s,a) is maximum. Q(s,a) is sum of reward and discounted value of next state.
From a state, when I proceed computing the best action, do I need to continue computing (iterating) the value of successor states over a path till the end state
(or)
is it enough to compute the value of immediate successor state alone and decide the action among the state that yields maximum value.

채택된 답변

Emmanouil Tzorakoleftherakis
Emmanouil Tzorakoleftherakis 2020년 7월 6일
Hi Gowri,
Using the Q value for a state+action pair encodes all the information till 'the end of the path' weighted by a discount factor (assuming you are following the same policy).
So assuming you have a critic tha approximates the Q function relatively well, you shouldn't need to check Q values of successor states.
  댓글 수: 3
Emmanouil Tzorakoleftherakis
Emmanouil Tzorakoleftherakis 2020년 7월 6일
If the approximation of the Q function is relatively accurate (whether that's through a table, neural network, polynomial, other), then yes, looking at the Q value of the current state/action pair should be sufficient when you are trying to 'extract' the policy.
In fact, if you look at vanilla DQN, even during training the Bellman equation looks one step ahead. I am not saying than n-step learning is not an option, but you certainly don't need all subsequent Q values.

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by