0 votes

Estimate the final V* values for a given grid world. 

  • Discount (gamma) is 0.9.
  • Noise (probability is 0.8, 0.1, 0.1.  That is, when moving in a direction, there is 80% probablity of going in that direction and 10% probability of going in one of two perpendicular directions.  
  • Terminal states are given in the table.  All non terminal states are left blank and need to be solved.
  • Living reward (R(s,a,s')) is 0.

Instead of iteration, use the solving approach.

1

0

1

0

0

1

0

1

in MDP by AlgoMeister (1.6k points)

1 Answer

+1 vote
 
Best answer
At first glace, with its 17 empty spaces, this problem looks long. However, we can take advantage of the symmetry to reduce the number of variables to solve for to four.

01 x1 00 x1 01
x1 x2 x3 x2 x1
00 x3 x4 x3 00
x1 x2 x3 x2 x1
01 x1 00 x1 01

We can also see the optimal policies for each cell and avoid the need to do multiple calculations and calculate a maximum.

- Spaces with x1 should aim toward the terminal states with 1
- Spaces with x2 should move toward the x1 spaces
- Spaces with x3 should move toward x2 (since x4 in the middle is the farthest away from the terminals)
- Spaces with x4 should move toward x3 (no real choice)

Now, the value given a policy pi and the state s is

V(s) = sum (over all states s) of T(s,pi(s),s')*[R(s,pi(s),s') + gamma*V(s')]

x1 = 0.8(0 + 0.9*1) + 0.1(0 + 0.9*x1) + 0.1(0+ 0.9*x2)
x1 = 0.72 + 0.09*x1 + 0.09*x2

x2 = 0.8(0 + 0.9*x1) + 0.1(0 + 0.9*x1) + 0.1(0 + 0.9*x3)
x2 = 0.81*x1 + 0.09*x3

x3 = 0.8(0 + 0.9*x2) + 0.1(0 + 0.9*0) + 0.1(0 + 0.9*x4)
x3 = 0.72*x2 + 0.09*x4

x4 = 0.8(0 + 0.9*x3) + 0.1(0 + 0.9*x3) + 0.1(0 + 0.9*x3)
x4 = 0.90*x3
by AlgoMeister (568 points)
selected by

Related questions

0 votes
1 answer
asked Feb 23, 2021 in MDP by Amrinder Arora AlgoMeister (1.6k points)
0 votes
2 answers
0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
The Book: Analysis and Design of Algorithms | Presentations on Slideshare | Lecture Notes, etc
...