At first glace, with its 17 empty spaces, this problem looks long. However, we can take advantage of the symmetry to reduce the number of variables to solve for to four.
01 x1 00 x1 01
x1 x2 x3 x2 x1
00 x3 x4 x3 00
x1 x2 x3 x2 x1
01 x1 00 x1 01
We can also see the optimal policies for each cell and avoid the need to do multiple calculations and calculate a maximum.
- Spaces with x1 should aim toward the terminal states with 1
- Spaces with x2 should move toward the x1 spaces
- Spaces with x3 should move toward x2 (since x4 in the middle is the farthest away from the terminals)
- Spaces with x4 should move toward x3 (no real choice)
Now, the value given a policy pi and the state s is
V(s) = sum (over all states s) of T(s,pi(s),s')*[R(s,pi(s),s') + gamma*V(s')]
x1 = 0.8(0 + 0.9*1) + 0.1(0 + 0.9*x1) + 0.1(0+ 0.9*x2)
x1 = 0.72 + 0.09*x1 + 0.09*x2
x2 = 0.8(0 + 0.9*x1) + 0.1(0 + 0.9*x1) + 0.1(0 + 0.9*x3)
x2 = 0.81*x1 + 0.09*x3
x3 = 0.8(0 + 0.9*x2) + 0.1(0 + 0.9*0) + 0.1(0 + 0.9*x4)
x3 = 0.72*x2 + 0.09*x4
x4 = 0.8(0 + 0.9*x3) + 0.1(0 + 0.9*x3) + 0.1(0 + 0.9*x3)
x4 = 0.90*x3