By using symmetry, we can intuit that at S6, optimal policy is to go left, and S4, the optimal policy is to go right. Suppose we call this policy p. Then, we can write that Vp(S2) = a = Vp(S8). Vp(S3) = b = Vp(S7). Vp(s4) = c = Vp(S6)
In terms of equations, we can write as:
c = 0.9 * (0.4 * 100 + 0.5 * c + 0.1 * c). That is, c = 36 + 0.54c. That is, c = 36/0.46 = 78.26.
Similarly, we can write:
b = 0.9 * (0.4 * c + 0.5 * 100 + 0.1 * b). That is, b = 45 + 0.36 c + 0.09 b. That is, b * 0.91 = 45 + 0.36 * c.
Using c = 78.26, we get b = 80.41
Similarly, we can solve for a:
a = 0.9 * (0.4 * b + 0.5 * c + 0.1 * a). That is, a = (0.36 b + 0.45 c)/0.91. That is, a = 70.51129.
This is policy evaluation. We still need to check that this policy is optimal, and if that is correct, we can claim that V*(S2) = Vp(S2).