Let's say you are given a '+'-shaped MDP with five states and a gamma (discount rate) of 1:
Given MDP
_ A _
B C D
_ E _
The input policy \Pi is as follows:
A -> Terminal
B -> C
C -> D
D -> Terminal
E -> C
Let's say you have the following observed episodes (training) though:
Episode 1:
B, east, C, -1
C, east, D, -1
D, exit, x, +10
Episode 2:
B, east, C, -1
C, east, D, -1
D, exit, x, +10
Episode 3:
E, north, C, -1
C, east, D, -1
D, exit, x, +10
Episode 4:
E, north, C, -1
C, north, A, -1
A, exit, x, -10
What are the output values based on these episodes?