Here I have formed solution by using policy iteration approach. I am assuming it will move to the right always. Only one action per state. We are given a discount factor (γ) of 0.9 and noise of 0.8, 0.1, 0.1. The first column has terminal states with a value of -1, and the last column has terminal states with a value of 1. There are no living rewards.
a - go to the right always, fixed policy.
We have the following grid world:
It can be shown as :
Policy Iteration:
A = 0.9* (0.8 * B + 0.1 * A + 0.1 * C)
B = 0.9* (0.8 * 1 + 0.1 * B + 0.1 * D )
C = 0.9* (0.8 * D + 0.1 * A + 0.1 * E )
D = 0.9* (0.8 * 1 + 0.1 * B + 0.1 * F )
E = 0.9* (0.8 * F + 0.1 * C + 0.1 * E )
F = 0.9* (0.8 * 1 + 0.1 * D + 0.1 * F )
--------------------------------------------------
A = 0.72B + 0.09A + 0.09C
B = 0.72 + 0.09B + 0.09D
C = 0.72D + 0.09A + 0.09E
D = 0.72 + 0.09B + 0.09F
E = 0.72F + 0.09C + 0.09E
F = 0.72 + 0.09D + 0.09F
-------------------------------------------------
0.72B + 0.09A + 0.09C - A = 0
0.72 + 0.09B + 0.09D - B = 0
0.72D + 0.09A + 0.09E - C = 0
0.72 + 0.09B + 0.09F - D = 0
0.72F + 0.09C + 0.09E - E = 0
0.72 + 0.09D + 0.09F - F = 0
------------------------------------------------
Now we have :
0.72B - 0.91A + 0.09C = 0
0.72 - 0.91B + 0.09D = 0
0.72D + 0.09A + 0.09E - C = 0
0.72 + 0.09B + 0.09F - D = 0
0.72F + 0.09C - 0.91E = 0
0.72 + 0.09D - 0.91F = 0
We can see that :
0.72 - 0.91F + 0.09D = 0
0.72 - 0.91B + 0.09D = 0
It means that B = F
---------------------------------
Now we have :
0.72B - 0.91A + 0.09C = 0
0.72 - 0.91B + 0.09D = 0
0.72D + 0.09A + 0.09E - C = 0
0.72 + 0.09B + 0.09B - D = 0
0.72B + 0.09C - 0.91E = 0
We can see that :
- 0.09C + 0.91E = 0.91A - 0.09C
It means that : A = E
----------------------------------
Now we have :
0.72B - 0.91A + 0.09C = 0
0.72 - 0.91B + 0.09D = 0
0.72D + 0.18A - C = 0
0.72 + 0.18B - D = 0
0.72B + 0.09C - 0.91A = 0
We can see that
0.91B - 0.09D = -0.18B + D
1.09B = 1.09D
B = D = F
----------------------------------
Now we have :
0.72B - 0.91A + 0.09C = 0
0.72 - 0.91B + 0.09D = 0
0.72B + 0.18A - C = 0
0.72 + 0.18B - B = 0
We can see that:
- 0.91A + 0.09C = 0.18A - C
A = C = E
So it can be shown as :
0.72B - 0.82A = 0
0.72 - 0.82B = 0
--------------------------------------------------------
B = D = F = 0.72/0.82 ≈ 0.88
A = C = E = ( 0.72 * B ) / 0.82 = (0.72 * 0.878) / 0.82 ≈ 0.77
So it can be shown as :
-1 | 0.77 | 0.88 | 1 |
-1 | 0.77 | 0.88 | 1 |
-1 | 0.77 | 0.88 | 1 |