By using symmetry,
we can infer that at S6, the optimal policy is to go left,
at S4, the optimal policy is to go right.
If we call this policy p.
1. Policy evaluation:
Then, we can write that Vp(S2) = a = Vp(S8). Vp(S3) = b = Vp(S7). Vp(S4) = c = Vp(S6)
In terms of equations, it can be written as:
c = 0.9 * (0.4 * 100 + 0.5 * c + 0.1 * c). That is, c = 36 + 0.54c. That is, c = 36/0.46 = 78.26.
Similarly, we can write:
b = 0.9 * (0.4 * c + 0.5 * 100 + 0.1 * b). That is, b = 45 + 0.36 c + 0.09 b. That is, b * 0.91 = 45 + 0.36 * 78.26 => b * 0.91 = 73.1736 => b=80.41
Similarly, we can solve for a:
a = 0.9 * (0.4 * b + 0.5 * c + 0.1 * a). That is, a = (0.36 b + 0.45 c)/0.91 => a = 64.1646/0.91 => a = 70.51.
This is policy evaluation.
2. Policy Improvement: Now, let's check if we can improve the policy by considering alternative actions for each state. We'll compare the expected values under the current policy and under alternative actions.
S2:
Expected value of going Left:
0.4×V(S3)+0.5×V(S2)+0.1×V(S2)=0.4×80.41+0.5×70.51+0.1×70.51=76.656
Expected value of going Right:
0.4×V(S1)+0.5×V(S4)+0.1×V(S4)=0.4*1.732 +0.5×78.26+0.1×78.26=66.716
Since 76.656>66.716, the optimal action at S2 remains Left.
3. Update Policy:
Since no changes were made to the policy in step 2, the policy remains the same.
4. Repeat:
Since there were no changes in step 3, we don't need to repeat the process.
Thus, the policy obtained through policy evaluation is indeed optimal, and We can claim that V*(S2) = Vp(S2).