regret: Calculate the Regret of a Policy

Description

Calculates the regret of a policy relative to a benchmark policy.

Usage

regret(policy, benchmark, start = NULL)

Value

the regret as a difference of expected long-term rewards.

Arguments

policy: a solved POMDP containing the policy to calculate the regret for.
benchmark: a solved POMDP with the (optimal) policy. Regret is calculated relative to this policy.
start: the used start (belief) state. If NULL then the start (belief) state of the benchmark is used.

Author

Michael Hahsler

Details

Regret is defined as \(V^{\pi^*}(s_0) - V^{\pi}(s_0)\) with \(V^\pi\) representing the expected long-term state value (represented by the value function) given the policy \(\pi\) and the start state \(s_0\). For POMDPs the start state is the start belief \(b_0\).

Note that for regret usually the optimal policy \(\pi^*\) is used as the benchmark. Since the optimal policy may not be known, regret relative to the best known policy can be used.

Examples

Run this code

data(Tiger)

sol_optimal <- solve_POMDP(Tiger)
sol_optimal

# perform exact value iteration for 10 epochs
sol_quick <- solve_POMDP(Tiger, method = "enum", horizon = 10)
sol_quick

regret(sol_quick, benchmark = sol_optimal)

Run the code above in your browser using DataLab