MDP_policy_functions: Functions for MDP Policies

Description

Implementation several functions useful to deal with MDP policies.

Usage

q_values_MDP(model, U = NULL)
MDP_policy_evaluation(
  pi,
  model,
  U = NULL,
  k_backups = 1000,
  theta = 0.001,
  verbose = FALSE
)
greedy_MDP_action(s, Q, epsilon = 0, prob = FALSE)
random_MDP_policy(model, prob = NULL)
manual_MDP_policy(model, actions)
greedy_MDP_policy(Q)

Value

q_values_MDP() returns a state by action matrix specifying the Q-function, i.e., the action value for executing each action in each state. The Q-values are calculated from the value function (U) and the transition model.

MDP_policy_evaluation() returns a vector with (approximate) state values (U).

greedy_MDP_action() returns the action with the highest q-value for state s. If prob = TRUE, then a vector with the probability for each action is returned.

random_MDP_policy() returns a data.frame with the columns state and action to define a policy.

manual_MDP_policy() returns a data.frame with the columns state and action to define a policy.

greedy_MDP_policy() returns the greedy policy given Q.

Arguments

model: an MDP problem specification.
U: a vector with value function representing the state utilities (expected sum of discounted rewards from that point on). If model is a solved model, then the state utilities are taken from the solution.
pi: a policy as a data.frame with at least columns for states and action.
k_backups: number of look ahead steps used for approximate policy evaluation used by the policy iteration method. Set k_backups to Inf to only use $\theta$ as the stopping criterion.
theta: stop when the largest change in a state value is less than $\theta$.
verbose: logical; should progress and approximation errors be printed.
s: a state.
Q: an action value function with Q-values as a state by action matrix.
epsilon: an epsilon > 0 applies an epsilon-greedy policy.
prob: probability vector for random actions for random_MDP_policy(). a logical indicating if action probabilities should be returned for greedy_MDP_action().
actions: a vector with the action (either the action label or the numeric id) for each state.

Author

Michael Hahsler

Details

Implemented functions are:

q_values_MDP() calculates (approximates) Q-values for a given model using the Bellman optimality equation:

$$q(s,a) = \sum_{s'} T(s'|s,a) [R(s,a) + \gamma U(s')]$$

Q-values can be used as the input for several other functions.
MDP_policy_evaluation() evaluates a policy $\pi$ for a model and returns (approximate) state values by applying the Bellman equation as an update rule for each state and iteration $k$:

$$U_{k+1}(s) =\sum_a \pi{a|s} \sum_{s'} T(s' | s,a) [R(s,a) + \gamma U_k(s')]$$

In each iteration, all states are updated. Updating is stopped after k_backups iterations or after the largest update $||U_{k+1} - U_k||_\infty < \theta$.
greedy_MDP_action() returns the action with the largest Q-value given a state.
random_MDP_policy(), manual_MDP_policy(), and greedy_MDP_policy() generates different policies. These policies can be added to a problem using add_policy().

References

Sutton, R. S., Barto, A. G. (2020). Reinforcement Learning: An Introduction. Second edition. The MIT Press.

Examples

Run this code

data(Maze)
Maze

# create several policies:
# 1. optimal policy using value iteration
maze_solved <- solve_MDP(Maze, method = "value_iteration")
maze_solved
pi_opt <- policy(maze_solved)
pi_opt
gridworld_plot_policy(add_policy(Maze, pi_opt), main = "Optimal Policy")

# 2. a manual policy (go up and in some squares to the right)
acts <- rep("up", times = length(Maze$states))
names(acts) <- Maze$states
acts[c("s(1,1)", "s(1,2)", "s(1,3)")] <- "right"
pi_manual <- manual_MDP_policy(Maze, acts)
pi_manual
gridworld_plot_policy(add_policy(Maze, pi_manual), main = "Manual Policy")

# 3. a random policy
set.seed(1234)
pi_random <- random_MDP_policy(Maze)
pi_random
gridworld_plot_policy(add_policy(Maze, pi_random), main = "Random Policy")

# 4. an improved policy based on one policy evaluation and
#   policy improvement step.
u <- MDP_policy_evaluation(pi_random, Maze)
q <- q_values_MDP(Maze, U = u)
pi_greedy <- greedy_MDP_policy(q)
pi_greedy
gridworld_plot_policy(add_policy(Maze, pi_greedy), main = "Greedy Policy")

#' compare the approx. value functions for the policies (we restrict
#'    the number of backups for the random policy since it may not converge)
rbind(
  random = MDP_policy_evaluation(pi_random, Maze, k_backups = 100),
  manual = MDP_policy_evaluation(pi_manual, Maze),
  greedy = MDP_policy_evaluation(pi_greedy, Maze),
  optimal = MDP_policy_evaluation(pi_opt, Maze)
)

# For many functions, we first add the policy to the problem description
#   to create a "solved" MDP
maze_random <- add_policy(Maze, pi_random)
maze_random

# plotting
plot_value_function(maze_random)
gridworld_plot_policy(maze_random)

# compare to a benchmark
regret(maze_random, benchmark = maze_solved)

# calculate greedy actions for state 1
q <- q_values_MDP(maze_random)
q
greedy_MDP_action(1, q, epsilon = 0, prob = FALSE)
greedy_MDP_action(1, q, epsilon = 0, prob = TRUE)
greedy_MDP_action(1, q, epsilon = .1, prob = TRUE)

Run the code above in your browser using DataLab