Hello Professor,
Since S5 has a terminal value of 100, the optimal strategy is to move toward S5. So for the states on the left side, S2, S3, and S4, the best action is R. For the states on the right side, S6, S7, and S8, the best action is L.
The terminal state values are:
V(S1) = 1.732
V(S5) = 100
V(S9) = 1.732
Because the MDP is symmetric around S5:
V(S2) = V(S8)
V(S3) = V(S7)
V(S4) = V(S6)
Let:
x = V(S2) = V(S8)
y = V(S3) = V(S7)
z = V(S4) = V(S6)
The discount factor is 0.9. There is no living reward, so the value comes only from discounted future values.
For S4, moving right gives:
z = 0.9(0.4(100) + 0.5z + 0.1z)
z = 0.9(40 + 0.6z)
z = 36 + 0.54z
0.46z = 36
z = 78.26
So:
V(S4) = V(S6) = 78.26
For S3, moving right gives:
y = 0.9(0.4z + 0.5(100) + 0.1y)
Substituting z = 78.26:
y = 0.9(0.4(78.26) + 50 + 0.1y)
y = 73.17 + 0.09y
0.91y = 73.17
y = 80.41
So:
V(S3) = V(S7) = 80.41
For S2, moving right gives:
x = 0.9(0.4y + 0.5z + 0.1x)
Substituting y = 80.41 and z = 78.26:
x = 0.9(0.4(80.41) + 0.5(78.26) + 0.1x)
x = 64.16 + 0.09x
0.91x = 64.16
x = 70.51
So:
V(S2) = V(S8) = 70.51
Therefore, the final values are:
V(S1) = 1.732
V(S2) = 70.51
V(S3) = 80.41
V(S4) = 78.26
V(S5) = 100
V(S6) = 78.26
V(S7) = 80.41
V(S8) = 70.51
V(S9) = 1.732
Final answer:
[1.732, 70.51, 80.41, 78.26, 100, 78.26, 80.41, 70.51, 1.732]
The values are not perfectly increasing as the states get closer to S5 because there is a chance of jumping two spaces. This means that being one step away from S5 is not always better than being two steps away, depending on the transition probabilities