Learn R Programming

contextual (version 0.9.8.4)

GittinsBrezziLaiPolicy: Policy: Gittins Approximation algorithm for choosing arms in a MAB problem.

Description

GittinsBrezziLaiPolicy Algorithm based on Brezzi and Lai (2002) "Optimal learning and experimentation in bandit problems."

Usage

  policy <- GittinsBrezziLaiPolicy$new(discount=0.95, prior=NULL)

Arguments

discount

numeric; discount factor

prior

numeric matrix; prior beliefs over Bernoulli parameters governing each arm. Beliefs are specified by Beta distribution with two parameters (alpha,beta) where alpha = number of success, beta = number of failures. Matrix is of arms times two (alpha / beta) dimensions

Methods

new(discount=0.95, prior=NULL)

Generates and initializes a new Policy object.

get_action(t, context)

arguments:

  • t: integer, time step t.

  • context: list, containing the current context$X (d x k context matrix), context$k (number of arms) and context$d (number of context features)

computes which arm to play based on the current values in named list theta and the current context. Returns a named list containing action$choice, which holds the index of the arm to play.

set_reward(t, context, action, reward)

arguments:

  • t: integer, time step t.

  • context: list, containing the current context$X (d x k context matrix), context$k (number of arms) and context$d (number of context features) (as set by bandit).

  • action: list, containing action$choice (as set by policy).

  • reward: list, containing reward$reward and, if available, reward$optimal (as set by bandit).

utilizes the above arguments to update and return the set of parameters in list theta.

set_parameters()

Helper function, called during a Policy's initialisation, assigns the values it finds in list self$theta_to_arms to each of the Policy's k arms. The parameters defined here can then be accessed by arm index in the following way: theta[[index_of_arm]]$parameter_name.

Details

The algorithm provides an approximation of the Gittins index, by specifying a closed-form expression, which is a function of the discount factor, and the number of successes and failures associated with each arm.

References

Brezzi, M., & Lai, T. L. (2002). Optimal learning and experimentation in bandit problems. Journal of Economic Dynamics and Control, 27(1), 87-108.

Implementation follows https://github.com/elarry/bandit-algorithms-simulated

See Also

Core contextual classes: Bandit, Policy, Simulator, Agent, History, Plot

Bandit subclass examples: BasicBernoulliBandit, ContextualLogitBandit, OfflineReplayEvaluatorBandit

Policy subclass examples: EpsilonGreedyPolicy, ContextualLinTSPolicy