OfflineBootstrappedReplayBandit: Bandit: Offline Bootstrapped Replay

Description

Policy for the evaluation of policies with offline data through replay with bootstrapping.

Usage

  bandit <- OfflineBootstrappedReplayBandit(formula,
                                            data, k = NULL, d = NULL,
                                            unique = NULL, shared = NULL,
                                            randomize = TRUE, replacement = TRUE,
                                            jitter = TRUE, arm_multiply = TRUE)

Arguments

formula: formula (required). Format: y.context ~ z.choice | x1.context + x2.xontext + ... By default, adds an intercept to the context model. Exclude the intercept, by adding "0" or "-1" to the list of contextual features, as in: y.context ~ z.choice | x1.context + x2.xontext -1
data: data.table or data.frame; offline data source (required)
k: integer; number of arms (optional). Optionally used to reformat the formula defined x.context vector as a k x d matrix. When making use of such matrix formatted contexts, you need to define custom intercept(s) when and where needed in data.table or data.frame.
d: integer; number of contextual features (optional) Optionally used to reformat the formula defined x.context vector as a k x d matrix. When making use of such matrix formatted contexts, you need to define custom intercept(s) when and where needed in data.table or data.frame.
randomize: logical; randomize rows of data stream per simulation (optional, default: TRUE)
replacement: logical; sample with replacement (optional, default: TRUE)
jitter: logical; add jitter to contextual features (optional, default: TRUE)
arm_multiply: logical; multiply the horizon by the number of arms (optional, default: TRUE)
multiplier: integer; replicate the dataset multiplier times before randomization. When arm_multiply has been set to TRUE, the number of replications is the number of arms times this integer. Can be used when Simulator's policy_time_loop has been set to TRUE, otherwise a simulation might run out of pre-indexed data.

unique

integer vector; index of disjoint features (optional)

shared

integer vector; index of shared features (optional)

Methods

new(formula, data, k = NULL, d = NULL, unique = NULL, shared = NULL, randomize = TRUE,
                  replacement = TRUE, jitter = TRUE, arm_multiply = TRUE)

generates and instantializes a new OfflineBootstrappedReplayBandit instance.

get_context(t)

argument:

t: integer, time step t.

returns a named list containing the current d x k dimensional matrix context$X, the number of arms context$k and the number of features context$d.

get_reward(t, context, action)

arguments:

t: integer, time step t.
context: list, containing the current context$X (d x k context matrix), context$k (number of arms) and context$d (number of context features) (as set by bandit).
action: list, containing action$choice (as set by policy).

returns a named list containing reward$reward and, where computable, reward$optimal (used by "oracle" policies and to calculate regret).

post_initialization()

Randomize offline data by shuffling the offline data.table before the start of each individual simulation when self$randomize is TRUE (default)

Details

The key assumption of the method is that that the original logging policy chose i.i.d. arms uniformly at random.

Take care: if the original logging policy does not change over trials, data may be used more efficiently via propensity scoring (Langford et al., 2008; Strehl et al., 2011) and related techniques like doubly robust estimation (Dudik et al., 2011).

References

Mary, J., Preux, P., & Nicol, O. (2014, January). Improving offline evaluation of contextual bandit algorithms via bootstrapping techniques. In International Conference on Machine Learning (pp. 172-180).

Examples

Run this code

# NOT RUN {
library(contextual)
library(data.table)

# Import personalization data-set

url         <- "http://d1ie9wlkzugsxr.cloudfront.net/data_cmab_basic/dataset.txt"
datafile    <- fread(url)

simulations <- 1
horizon     <- nrow(datafile)

bandit      <- OfflineReplayEvaluatorBandit$new(formula = V2 ~ V1 | . - V1, data = datafile)

# Define agents.
agents      <- list(Agent$new(LinUCBDisjointOptimizedPolicy$new(0.01), bandit, "alpha = 0.01"),
                    Agent$new(LinUCBDisjointOptimizedPolicy$new(0.05), bandit, "alpha = 0.05"),
                    Agent$new(LinUCBDisjointOptimizedPolicy$new(0.1),  bandit, "alpha = 0.1"),
                    Agent$new(LinUCBDisjointOptimizedPolicy$new(1.0),  bandit, "alpha = 1.0"))

# Initialize the simulation.

simulation  <- Simulator$new(agents = agents, simulations = simulations, horizon = horizon,
                             do_parallel = FALSE, save_context = TRUE)

# Run the simulation.
sim  <- simulation$run()

# plot the results
plot(sim, type = "cumulative", regret = FALSE, rate = TRUE,
     legend_position = "bottomright", ylim = c(0,1))


# }

Run the code above in your browser using DataLab