For the following action-selection method, indicate which option describes it best.
With probability p, select argmax a Q(s,a). With probability 1 − p, select a random action. Let p = 0.99.
Select action a with probability
where τ is a temperature parameter that is decreased over time.
Always select a random action.
What model would be learned from the above observed episodes?
Note: T(s,a,s’) represents the transition probability from state s to state s’ under action a.
Use a period as a decimal separator.
T(A, south, C) =
T(B, east, C) =
T(C, south, E) =
T(C, south, D) =