Jumpy Car - Estimate/Solve the V* values for the following MDP

Question

Jumpy Car - Estimate/Solve the V* values for the following MDP

You are traveling on a straight road, but have a jumpy car. The car sometimes "jumps" (moves double). At other times, it doesn't move at all. The following MDP has been created to model this behavior and the landscape.

Estimate the V* values (optimal values for the states) for this MDP:

S1	S2	S3	S4	S5	S6	S7	S8	S9
sqrt(3)				100				sqrt(3)

MDP is defined as follows: There are two actions L (Left) and R (Right). When moving left, there is a 40% chance moving left, 50% chance of moving DOUBLE (2 spots left), and 10% chance of not moving at all. Similarly, when moving right, there is a 40% chance moving right, 50% chance of moving DOUBLE (2 spots right), and 10% chance of not moving at all.

S1, S5 and S9 are terminal states with values sqrt(3), 100 and sqrt(3) respectively.

Discount factor is 0.9 (so, gamma = 0.9).

There is no living reward, that is R(s,a,s’) = 0.

asked Mar 1, 2021 in MDP by Amrinder Arora AlgoMeister (1.6k points)
edited Mar 1, 2021 by Amrinder Arora

2 Answers

Related questions

0 votes

1 answer

Solve the V* values for this MDP

asked Feb 23, 2021 in MDP by Amrinder Arora AlgoMeister (1.6k points)

0 votes

1 answer

Solve the V* values for this MDP - 5x5

asked Mar 30, 2021 in MDP by Amrinder Arora AlgoMeister (1.6k points)

0 votes

1 answer

Solve the V* values for this grid world MDP - 3 x 4

asked Apr 16, 2023 in Informed Search by Amrinder Arora AlgoMeister (1.6k points)

0 votes

1 answer

Evaluate an MDP given several observed episodes

asked May 11, 2023 in MDP by bulldozer070 AlgoMeister (568 points)

0 votes

1 answer

Stationary Distribution for this conditional probability table

asked Apr 4, 2020 in Informed Search by Amrinder Arora AlgoMeister (1.6k points)

Amrinder Arora · Answer 1 · 2021-03-07T19:16:01+0000

By using symmetry, we can intuit that at S6, optimal policy is to go left, and S4, the optimal policy is to go right. Suppose we call this policy p. Then, we can write that Vp(S2) = a = Vp(S8). Vp(S3) = b = Vp(S7). Vp(s4) = c = Vp(S6)

In terms of equations, we can write as:

c = 0.9 * (0.4 * 100 + 0.5 * c + 0.1 * c). That is, c = 36 + 0.54c. That is, c = 36/0.46 = 78.26.

Similarly, we can write:

b = 0.9 * (0.4 * c + 0.5 * 100 + 0.1 * b). That is, b = 45 + 0.36 c + 0.09 b. That is, b * 0.91 = 45 + 0.36 * c.

Using c = 78.26, we get b = 80.41

Similarly, we can solve for a:

a = 0.9 * (0.4 * b + 0.5 * c + 0.1 * a). That is, a = (0.36 b + 0.45 c)/0.91. That is, a = 70.51129.

This is policy evaluation. We still need to check that this policy is optimal, and if that is correct, we can claim that V*(S2) = Vp(S2).

Divya Sree Vadlamudi · Answer 2 · 2024-05-04T22:41:06+0000

By using symmetry,

we can infer that at S6, the optimal policy is to go left,

at S4, the optimal policy is to go right.

If we call this policy p.

1. Policy evaluation:

Then, we can write that Vp(S2) = a = Vp(S8).   Vp(S3) = b = Vp(S7).   Vp(S4) = c = Vp(S6)

In terms of equations, it can be written as:

c = 0.9 * (0.4 * 100 + 0.5 * c + 0.1 * c).   That is, c = 36 + 0.54c. That is, c = 36/0.46 = 78.26.

Similarly, we can write:

b = 0.9 * (0.4 * c + 0.5 * 100 + 0.1 * b).   That is, b = 45 + 0.36 c + 0.09 b. That is, b * 0.91 = 45 + 0.36 * 78.26 => b * 0.91 = 73.1736 => b=80.41

Similarly, we can solve for a:

a = 0.9 * (0.4 * b + 0.5 * c + 0.1 * a).   That is, a = (0.36 b + 0.45 c)/0.91 => a = 64.1646/0.91 => a = 70.51.

This is policy evaluation.

2. Policy Improvement: Now, let's check if we can improve the policy by considering alternative actions for each state. We'll compare the expected values under the current policy and under alternative actions.

S2:

Expected value of going Left:

0.4×V(S3)+0.5×V(S2)+0.1×V(S2)=0.4×80.41+0.5×70.51+0.1×70.51=76.656

Expected value of going Right:

0.4×V(S1)+0.5×V(S4)+0.1×V(S4)=0.4*1.732 +0.5×78.26+0.1×78.26=66.716

Since 76.656>66.716, the optimal action at S2 remains Left.

3. Update Policy:

Since no changes were made to the policy in step 2, the policy remains the same.

4. Repeat:

Since there were no changes in step 3, we don't need to repeat the process.

Thus, the policy obtained through policy evaluation is indeed optimal, and We can claim that V*(S2) = Vp(S2).

Categories

Most popular tags

Jumpy Car - Estimate/Solve the V* values for the following MDP

Please log in or register to add a comment.

Please log in or register to answer this question.

2 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions