simulate_POMDP: Simulate Trajectories Through a POMDP

Description

Simulate trajectories through a POMDP. The start state for each trajectory is randomly chosen using the specified belief. The belief is used to choose actions from the the epsilon-greedy policy and then updated using observations.

Usage

simulate_POMDP(
  model,
  n = 1000,
  belief = NULL,
  horizon = NULL,
  epsilon = NULL,
  delta_horizon = 0.001,
  digits = 7L,
  return_beliefs = FALSE,
  return_trajectories = FALSE,
  engine = "cpp",
  verbose = FALSE,
  ...
)

Value

A list with elements:

avg_reward: The average discounted reward.
action_cnt: Action counts.
state_cnt: State counts.
reward: Reward for each trajectory.
belief_states: A matrix with belief states as rows.
trajectories: A data.frame with the episode id, time, the state of the simulation (simulation_state), the id of the used alpha vector given the current belief (see belief_states above), the action a and the reward r.

Arguments

model: a POMDP model.
n: number of trajectories.
belief: probability distribution over the states for choosing the starting states for the trajectories. Defaults to the start belief state specified in the model or "uniform".
horizon: number of epochs for the simulation. If NULL then the horizon for finite-horizon model is used. For infinite-horizon problems, a horizon is calculated using the discount factor.
epsilon: the probability of random actions for using an epsilon-greedy policy. Default for solved models is 0 and for unsolved model 1.
delta_horizon: precision used to determine the horizon for infinite-horizon problems.
digits: round probabilities for belief points.
return_beliefs: logical; Return all visited belief states? This requires n x horizon memory.
return_trajectories: logical; Return the simulated trajectories as a data.frame?
engine: 'cpp', 'r' to perform simulation using a faster C++ or a native R implementation.
verbose: report used parameters.
...: further arguments are ignored.

Author

Michael Hahsler

Details

Simulates n trajectories. If no simulation horizon is specified, the horizon of finite-horizon problems is used. For infinite-horizon problems with $\gamma < 1$, the simulation horizon $T$ is chosen such that the worst-case error is no more than $\delta_\text{horizon}$. That is

$$\gamma^T \frac{R_\text{max}}{\gamma} \le \delta_\text{horizon},$$

where $R_\text{max}$ is the largest possible absolute reward value used as a perpetuity starting after $T$.

A native R implementation (engine = 'r') and a faster C++ implementation (engine = 'cpp') are available. Currently, only the R implementation supports multi-episode problems.

Both implementations support the simulation of trajectories in parallel using the package foreach. To enable parallel execution, a parallel backend like doparallel needs to be registered (see doParallel::registerDoParallel()). Note that small simulations are slower using parallelization. C++ simulations with n * horizon less than 100,000 are always executed using a single worker.

Examples

Run this code

data(Tiger)

# solve the POMDP for 5 epochs and no discounting
sol <- solve_POMDP(Tiger, horizon = 5, discount = 1, method = "enum")
sol
policy(sol)

# uncomment the following line to register a parallel backend for simulation 
# (needs package doparallel installed)

# doParallel::registerDoParallel()
# foreach::getDoParWorkers()

## Example 1: simulate 100 trajectories
sim <- simulate_POMDP(sol, n = 100, verbose = TRUE)
sim

# calculate the percentage that each action is used in the simulation
round_stochastic(sim$action_cnt / sum(sim$action_cnt), 2)

# reward distribution
hist(sim$reward)


## Example 2: look at the belief states and the trajectories starting with 
#             an initial start belief.
sim <- simulate_POMDP(sol, n = 100, belief = c(.5, .5), 
  return_beliefs = TRUE, return_trajectories = TRUE)
head(sim$belief_states)
head(sim$trajectories)

# plot with added density (the x-axis is the probability of the second belief state)
plot_belief_space(sol, sample = sim$belief_states, jitter = 2, ylim = c(0, 6))
lines(density(sim$belief_states[, 2], bw = .02)); axis(2); title(ylab = "Density")


## Example 3: simulate trajectories for an unsolved POMDP which uses an epsilon of 1
#             (i.e., all actions are randomized). The simulation horizon for the 
#             infinite-horizon Tiger problem is calculated using delta_horizon. 
sim <- simulate_POMDP(Tiger, return_beliefs = TRUE, verbose = TRUE)
sim$avg_reward

hist(sim$reward, breaks = 20)

plot_belief_space(sol, sample = sim$belief_states, jitter = 2, ylim = c(0, 6))
lines(density(sim$belief_states[, 1], bw = .05)); axis(2); title(ylab = "Density")

Run the code above in your browser using DataLab