gps: Empirical Bayes Gamma-Poisson Shrinker

Description

Test on device-events using William DuMouchel's Empirical Bayes Gamma-Poisson Shrinker. From the family of disproportionality analyses (DPA) used to generate signals of disproportionate reporting (SDRs).

Usage

gps(df, ...)
# S3 method for mds_ts
gps(df, ts_event = c(Count = "nA"), analysis_of = NA, ...)
# S3 method for default
gps(
  df,
  analysis_of = NA,
  eval_period = 1,
  null_ratio = 1,
  cred_interval = 0.9,
  init_prior = c(0.2, 0.02, 2, 4, 1/3),
  gamma_lower = 1e-05,
  gamma_upper = 20,
  quantiles = c(0.05, 0.95),
  cont_adj = 0,
  ...
)

Arguments

Required input data frame of class mds_ts or, for generic usage, any data frame with the following columns:

time: Unique times of class Date
nA: Cell A count (class numeric) of the 2x2 table: device/event of interest.
nB: Cell B count (class numeric) of the 2x2 table: device/non-event of interest.
nC: Cell C count (class numeric) of the 2x2 table: non-device/event of interest.
nD: Cell D count (class numeric) of the 2x2 table: non-device/non-event of interest.

...

Further arguments passed onto gps methods

ts_event

Required if df is of class mds_ts. Named string indicating the variable corresponding to the event count (cell A in the 2x2 contingency table). In most cases, the default is the appropriate setting. See details for alternative options.

Default: c("Count"="nA") corresponding to the event count column in mds_ts objects. Name is generated from mds_ts metadata.

analysis_of

Optional string indicating the English description of what was analyzed. If specified, this will override the name of the ts_event string parameter.

Default: NA indicates no English description for plain df data frames, or ts_event English description for df data frames of class mds_ts.

Example: "Count of bone cement leakages"

eval_period

Required positive integer indicating the number of unique times counting in reverse chronological order to sum over to create the 2x2 contingency table.

Default: 1 considers only the most recent time in df.

Example: 12 sums over the last 12 time periods to create the 2x2 contingency table.

null_ratio

Numeric value representing the null relative reporting ratio (RR), used with cred_interval to establish the signal status. This null_ratio is saved in the output as the signal threshold. See details for more.

Default: 1 indicates a null RR of 1 and tests if the lower bound of the cred_interval exceeds 1.

cred_interval

Numeric value between 0 and 1 representing the width of the Bayesian posterior credible interval, where the lower bound of the interval is assessed against the null_ratio. The interval bounds are returned as the lcl and ucl. See details for more.

Default: 0.90 indicates a 90% credible interval with bounds at 5% and 95%. The signal test is against the lower 5% bound, effectively creating the EB05 test.

init_prior

A numeric vector of length 5 representing the initialization parameters for the prior gamma mixture distribution in this order: alpha1, beta1, alpha2, beta2, p. See details for more.

Default: c(.2, .02, 2, 4, 1/3) as suggested in openEBGM package v0.7.0.

gamma_lower

Positive mumeric value representing the lower bound for the two alphas and betas of the prior during PORT optimization.

Default: 1e-5 is a value suggested in openEBGM package v0.7.0.

gamma_upper

Positive mumeric value representing the upper bound for the two alphas and betas of the prior during PORT optimization.

Default: 20 is a value suggested in openEBGM package v0.7.0.

quantiles

Vector of quantiles between 0 and 1. gps() will return an equal length vector of estimated empirical Bayes quantiles from the posterior distribution. Specify quantiles=NULL if no quantiles are desired.

Default: c(.05, .95) corresponds to the 5% (EB05) and 95% (EB95) quantiles.

cont_adj

Positive integer representing the continuity adjustment to be added to each cell of the 2x2 contingency table. A value greater than 0 allows for contingency tables with 0 cells to run the algorithm. Adding a continuity adjustment will adversely affect the algorithm estimates, user discretion is advised. See details for more.

Default: 0 adds zero to each cell, thus an unadjusted table.

Value

A named list of class mdsstat_test object, as follows:

test_name: Name of the test run
analysis_of: English description of what was analyzed
status: Named boolean of whether the test was run. The name contains the run status.
result: A standardized list of test run results: statistic for the test statistic, lcl and ucl for the set confidence bounds, p for the p-value, signal status, and signal_threshold.
params: The test parameters
data: The data on which the test was run

Methods (by class)

mds_ts: GPS on mds_ts data
default: GPS on general data

Details

null_ratio and cred_interval are used together to establish the signal criteria. The null_ratio is conceptually similar to the relative reporting ratio under a null hypothesis of no signal. Common values are 1 and, more conservatively (fewer false signals), 2. The cred_interval is the posterior credibility interval used to test for a signal. A value of 0.90 returns the 5 tests if the lower bound exceeds null_ratio. Effectively, cred_interval=0.90 conducts the well-known EB05 test.

init_prior specifies the initial guess for the 5 parameters of the prior gamma mixture distribution as described in DuMouchel (1999, Eqs. 4, 7) in the sequence: alpha1, beta1, alpha2, beta2, p. gamma_lower specifies the optimization lower bound for the two alphas and betas. gamma_upper specifies similarly the upper bound. The initial guess, upper and lower bounds are fed into PORT optimization using the stats::nlminb() routine.

cont_adj provides the option to allow gps() to proceed running, however this is done at the user's discretion because there are adverse effects of adding a positive integer to every cell of the contingency table. By default, gps() runs with 0 in the C cell only, but not in A, B, or D. It has been suggested that 0.5 may be an appropriate value. However, values <1 have been shown to be unstable using box-constrained PORT optimization, which is the only optimization considered in this release. Overall, posterior distribution estimates have been shown to be unstable with very low or 0 count cells.

For parameter ts_event, in the uncommon case where the device-event count (Cell A) variable is not "nA", the name of the variable may be specified here. Note that the remaining 3 cells of the 2x2 contingency table (Cells B, C, D) must be the variables "nB", "nC", and "nD" respectively in df. A named character vector may be used where the name is the English description of what was analyzed. Note that if the parameter analysis_of is specified, it will override this name. Example: ts_event=c("Count of Bone Cement Leakages"="event_count")

References

DuMouchel W. Bayesian data mining in large frequency tables, with an application to the FDA spontaneous reporting system. The American Statistician, 53(3):177-190, August 1999.

Ahmed I, Poncet A. PhViD: PharmacoVigilance Signal Detection, 2016. R package version 1.0.8.

Ihrie J, Canida T. openEBGM: EBGM Scores for Mining Large Contingency Tables, 2018. R package version 0.7.0.

Examples

Run this code

# NOT RUN {
# Basic Example
data <- data.frame(time=c(1:25),
                   nA=as.integer(stats::rnorm(25, 25, 5)),
                   nB=as.integer(stats::rnorm(25, 50, 5)),
                   nC=as.integer(stats::rnorm(25, 100, 25)),
                   nD=as.integer(stats::rnorm(25, 200, 25)))
a1 <- gps(data)
# Example using an mds_ts object
a2 <- gps(mds_ts[[3]])

# }

Run the code above in your browser using DataLab