0 votes

You are traveling on a straight road, but have a jumpy car.  The car sometimes "jumps" (moves double).  At other times, it doesn't move at all.  The following MDP has been created to model this behavior and the landscape.

Estimate the V* values (optimal values for the states) for this MDP:

S1S2S3S4S5S6S7S8S9
sqrt(3)100sqrt(3)

MDP is defined as follows: There are two actions L (Left) and R (Right).  When moving left, there is a 40% chance moving left, 50% chance of moving DOUBLE (2 spots left), and 10% chance of not moving at all.  Similarly, when moving right, there is a 40% chance moving right, 50% chance of moving DOUBLE (2 spots right), and 10% chance of not moving at all.

S1, S5 and S9 are terminal states with values sqrt(3), 100 and sqrt(3) respectively.  

Discount factor is 0.9 (so, gamma = 0.9).

There is no living reward, that is R(s,a,s’) = 0.

in MDP by AlgoMeister (1.6k points)
edited by

2 Answers

0 votes

By using symmetry, we can intuit that at S6, optimal policy is to go left, and S4, the optimal policy is to go right.  Suppose we call this policy p. Then, we can write that Vp(S2) = a = Vp(S8).   Vp(S3) = b = Vp(S7).   Vp(s4) = c = Vp(S6)

In terms of equations, we can write as:

c = 0.9 * (0.4 * 100 + 0.5 * c + 0.1 * c).   That is, c = 36 + 0.54c.  That is, c = 36/0.46 = 78.26.

Similarly, we can write: 

b = 0.9 * (0.4 * c + 0.5 * 100 + 0.1 * b).   That is, b = 45 + 0.36 c + 0.09 b.  That is, b * 0.91 = 45 + 0.36 * c.

Using c = 78.26, we get b = 80.41

Similarly, we can solve for a:

a = 0.9 * (0.4 * b + 0.5 * c + 0.1 * a).   That is, a = (0.36 b + 0.45 c)/0.91.  That is, a = 70.51129.

This is policy evaluation.  We still need to check that this policy is optimal, and if that is correct, we can claim that V*(S2) = Vp(S2).

by AlgoMeister (1.6k points)
0 votes
By using symmetry,

we can infer that at S6, the optimal policy is to go left,

at S4, the optimal policy is to go right.

If we call this policy p.

1. Policy evaluation:

Then, we can write that Vp(S2) = a = Vp(S8).   Vp(S3) = b = Vp(S7).   Vp(S4) = c = Vp(S6)

In terms of equations, it can be written as:

c = 0.9 * (0.4 * 100 + 0.5 * c + 0.1 * c).   That is, c = 36 + 0.54c.  That is, c = 36/0.46 = 78.26.

Similarly, we can write:

b = 0.9 * (0.4 * c + 0.5 * 100 + 0.1 * b).   That is, b = 45 + 0.36 c + 0.09 b.  That is, b * 0.91 = 45 + 0.36 * 78.26 => b * 0.91 = 73.1736 => b=80.41

Similarly, we can solve for a:

a = 0.9 * (0.4 * b + 0.5 * c + 0.1 * a).   That is, a = (0.36 b + 0.45 c)/0.91 => a = 64.1646/0.91 => a = 70.51.

This is policy evaluation.

2. Policy Improvement: Now, let's check if we can improve the policy by considering alternative actions for each state. We'll compare the expected values under the current policy and under alternative actions.

S2:

Expected value of going Left:

0.4×V(S3)+0.5×V(S2)+0.1×V(S2)=0.4×80.41+0.5×70.51+0.1×70.51=76.656

Expected value of going Right:

0.4×V(S1)+0.5×V(S4)+0.1×V(S4)=0.4*1.732​ +0.5×78.26+0.1×78.26=66.716

Since 76.656>66.716, the optimal action at S2 remains Left.

3. Update Policy:

Since no changes were made to the policy in step 2, the policy remains the same.

4. Repeat:

Since there were no changes in step 3, we don't need to repeat the process.

Thus, the policy obtained through policy evaluation is indeed optimal, and We can claim that V*(S2) = Vp(S2).
by AlgoMeister (900 points)

Related questions

0 votes
1 answer
asked Feb 23, 2021 in MDP by Amrinder Arora AlgoMeister (1.6k points)
0 votes
1 answer
asked Mar 30, 2021 in MDP by Amrinder Arora AlgoMeister (1.6k points)
0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
The Book: Analysis and Design of Algorithms | Presentations on Slideshare | Lecture Notes, etc
...