0 votes

Estimate the final V* values for a given grid world.  Discount, noise, terminal states are all given.  You are given a 3x4 grid world and you want to find V* values.  There is a discount rate of 0.9 and noise of 0.8, 0.1, 0.1.  The first column has all terminal states with value -1 and the last column has terminal states with value 1.  There is no living reward.

-1

 

 

1

-1

 

 

1

-1

 

 

1

in Informed Search by AlgoMeister (1.6k points)
The noise of 0.8, 0.1, 0.1 means that there is a 0.8 probability of going in the attempted direction and 0.1 in each of the perpendicular directions. Of course, if there is a wall in any direction, the movement does not happen.

1 Answer

+1 vote
 
Best answer

Here I have formed solution by using policy iteration approach. I am assuming it will move to the right always. Only one action per state. We are given a discount factor (γ) of 0.9 and noise of 0.8, 0.1, 0.1. The first column has terminal states with a value of -1, and the last column has terminal states with a value of 1. There are no living rewards.

a - go to the right always, fixed policy.


We have the following grid world:
 

-1001
-1001
-1001

It can be shown as :

-1AB1
-1CD1
-1EF1

Policy Iteration:

A = 0.9* (0.8 * B + 0.1 * A + 0.1 * C)

B = 0.9* (0.8 * 1 + 0.1 * B + 0.1 * D )

C = 0.9* (0.8 * D + 0.1 * A + 0.1 * E )

D = 0.9* (0.8 * 1 + 0.1 * B + 0.1 * F )

E = 0.9* (0.8 * F + 0.1 * C + 0.1 * E )

F = 0.9* (0.8 * 1 + 0.1 * D + 0.1 * F )

--------------------------------------------------

A = 0.72B + 0.09A  + 0.09C

B = 0.72 + 0.09B + 0.09D

C = 0.72D + 0.09A  + 0.09E

D = 0.72 + 0.09B + 0.09F

E = 0.72F + 0.09C  + 0.09E

F = 0.72 + 0.09D + 0.09F

-------------------------------------------------

0.72B + 0.09A  + 0.09C - A = 0

0.72 + 0.09B + 0.09D - B   = 0

0.72D + 0.09A  + 0.09E - C = 0

0.72 + 0.09B + 0.09F - D = 0

0.72F + 0.09C  + 0.09E - E = 0

0.72 + 0.09D + 0.09F - F = 0

------------------------------------------------

Now we have : 

0.72B - 0.91A  + 0.09C  = 0

0.72  - 0.91B  + 0.09D  = 0

0.72D + 0.09A  + 0.09E - C = 0

0.72 + 0.09B + 0.09F - D = 0

0.72F + 0.09C  - 0.91E  = 0

0.72 + 0.09D  - 0.91F   =  0

We can see that : 

0.72 -  0.91F  + 0.09D   =  0

0.72  - 0.91B  + 0.09D   =  0

It means that B = F

---------------------------------

Now we have : 

0.72B - 0.91A  + 0.09C  = 0

0.72  - 0.91B  + 0.09D  = 0

0.72D + 0.09A  + 0.09E - C = 0

0.72 + 0.09B + 0.09B - D = 0

0.72B + 0.09C  - 0.91E  = 0

We can see that : 

- 0.09C + 0.91E  = 0.91A  - 0.09C

It means that : A = E

----------------------------------

Now we have : 

0.72B - 0.91A  + 0.09C  = 0

0.72  - 0.91B  + 0.09D  = 0

0.72D + 0.18A   - C = 0

0.72 +  0.18B - D = 0

0.72B + 0.09C  - 0.91A  = 0

We can see that

0.91B  - 0.09D  = -0.18B + D

1.09B = 1.09D

B = D = F

----------------------------------

Now we have : 

0.72B - 0.91A  + 0.09C  = 0

0.72  - 0.91B  + 0.09D  = 0

0.72B + 0.18A  - C = 0

0.72 +  0.18B - B = 0

We can see that:

- 0.91A  + 0.09C  = 0.18A  - C 

A = C = E


So it can be shown as :

0.72B - 0.82A  = 0

0.72  - 0.82B  = 0

--------------------------------------------------------

B = D = F = 0.72/0.82  0.88

A =  C =  E  =  ( 0.72 * B ) / 0.82 = (0.72 * 0.878) /  0.82   0.77

So it can be shown as :

-10.770.881
-10.770.881
-10.770.881
 

by (248 points)
selected by
We need to use policy iteration/policy evaluation in this question.  That is the reason this question was in the exam.  The answers also do not look correct to me.

Related questions

0 votes
1 answer
asked Mar 30, 2021 in MDP by Amrinder Arora AlgoMeister (1.6k points)
0 votes
1 answer
asked Feb 23, 2021 in MDP by Amrinder Arora AlgoMeister (1.6k points)
0 votes
1 answer
0 votes
1 answer
asked May 13, 2019 in Informed Search by Amrinder Arora AlgoMeister (1.6k points)
The Book: Analysis and Design of Algorithms | Presentations on Slideshare | Lecture Notes, etc
...