ebnm: Solve the EBNM problem

Description

Solves the empirical Bayes normal means (EBNM) problem using a specified family of priors. For an article-length introduction to the package, see Willwerscheid and Stephens (2021), cited in References below.

Usage

ebnm(
  x,
  s = 1,
  prior_family = c("point_normal", "point_laplace", "point_exponential", "normal",
    "horseshoe", "normal_scale_mixture", "unimodal", "unimodal_symmetric",
    "unimodal_nonnegative", "unimodal_nonpositive", "generalized_binary", "npmle",
    "deconvolver", "flat", "point_mass", "ash"),
  mode = 0,
  scale = "estimate",
  g_init = NULL,
  fix_g = FALSE,
  output = ebnm_output_default(),
  optmethod = NULL,
  control = NULL,
  ...
)
ebnm_output_default()
ebnm_output_all()

Value

An ebnm object. Depending on the argument to output, the object is a list containing elements:

data: A data frame containing the observations x and standard errors s.
posterior: A data frame of summary results (posterior means, standard deviations, second moments, and local false sign rates).
fitted_g: The fitted prior $\hat{g}$ (an object of class normalmix, laplacemix, gammamix, unimix, tnormalmix, or horseshoe).
log_likelihood: The optimal log likelihood attained, $L(\hat{g})$.
posterior_sampler: A function that can be used to produce samples from the posterior. For all prior families other than the horseshoe, the sampler takes a single parameter nsamp, the number of posterior samples to return per observation. Since ebnm_horseshoe returns an MCMC sampler, it additionally takes parameter burn, the number of burn-in samples to discard.

S3 methods coef, confint, fitted, logLik,

nobs, plot, predict, print, quantile,

residuals, simulate, summary, and vcov

have been implemented for ebnm objects. For details, see the respective help pages, linked below under See Also.

Arguments

x

A vector of observations. Missing observations (NAs) are not allowed.

s

A vector of standard errors (or a scalar if all are equal). Standard errors may not be exactly zero, and missing standard errors are not allowed. Two prior families have additional restrictions: when horseshoe priors are used, errors must be homoskedastic; and since function deconv in package deconvolveR takes $z$-scores, the "deconvolver" family requires that all standard errors be equal to 1.

prior_family

A character string that specifies the prior family $G$. See Details below.

mode

A scalar specifying the mode of the prior $g$ or "estimate" if the mode is to be estimated from the data. This parameter is ignored by the NPMLE, the deconvolveR family, and the improper uniform (or "flat") prior. For generalized binary priors, which are bimodal, the mode parameter specifies the mode of the truncated normal component (the location of the point mass is fixed at zero).

scale

A scalar or vector specifying the scale parameter(s) of the prior or "estimate" if the scale parameters are to be estimated from the data. This parameter is ignored by the flat prior and the family of point mass priors.

The interpretation of scale depends on the prior family. For normal and point-normal families, it is a scalar specifying the standard deviation of the normal component. For point-Laplace and point-exponential families, it is a scalar specifying the scale parameter of the Laplace or exponential component. For the horseshoe family, it corresponds to $s\tau$ in the usual parametrization of the horseshoe distribution. For the family of generalized binary priors, it specifies the ratio of the (untruncated) standard deviation of the normal component to its mode. This ratio must be fixed in advance (i.e., argument "estimate" is unavailable for generalized binary priors). For the NPMLE and deconvolveR prior family, scale is a scalar specifying the distance between successive means in the grid of point masses or normal distributions used to estimate $g$. For all other prior families, which are implemented using the function ash in package ashr, it is a vector specifying the parameter mixsd to be passed to ash or "estimate" if mixsd is to be chosen by ebnm. (Note that ebnm chooses mixsd differently from ash: see functions ebnm_scale_normalmix, ebnm_scale_unimix, and ebnm_scale_npmle for details. To use the ash grid, set scale = "estimate" and pass in gridmult as an additional parameter. See ash for defaults and details.)

g_init

The prior distribution $g$. Usually this is left unspecified (NULL) and estimated from the data. However, it can be used in conjuction with fix_g = TRUE to fix the prior (useful, for example, to do computations with the "true" $g$ in simulations). If g_init is specified but fix_g = FALSE, g_init specifies the initial value of $g$ used during optimization. For non-parametric priors, this has the side effect of fixing the mode and scale parameters. If g_init is supplied, it should be an object of class normalmix for normal, point-normal, scale mixture of normals, and deconvolveR prior families, as well as for the NPMLE; class laplacemix for point-Laplace families; class gammamix for point-exponential families; class horseshoe for horseshoe families; class unimix for unimodal_ families; or class tnormalmix for generalized binary priors. An object of class ebnm can also be supplied as argument, provided that field fitted_g contains a prior of the correct class (see Examples below).

fix_g

If TRUE, fix the prior $g$ at g_init instead of estimating it.

output

A character vector indicating which values are to be returned. Function ebnm_output_default() provides the default return values, while ebnm_output_all() lists all possible return values. See Value below.

optmethod

A string specifying which optimization function is to be used. Since all non-parametric families rely upon external packages, this parameter is only available for parametric families (point-normal, point-Laplace, point-exponential, and normal). Options include "nlm", "lbfgsb" (which calls optim with method = "L-BFGS-B"), and "trust" (which calls into package trust). Other options are "nohess_nlm", "nograd_nlm", and "nograd_lbfgsb", which use numerical approximations rather than exact expressions for the Hessian and (for the latter two) the gradient. The default option is "nohess_nlm".

control

A list of control parameters to be passed to the optimization function. optimize is used for normal and horseshoe prior families, while nlm is used for parametric families unless parameter optmethod specifies otherwise. nlm is also used for the deconvolveR prior family. For ash families (including scale mixtures of normals, the NPMLE, and all unimodal_ families), function mixsqp in package mixsqp is the default. For generalized binary priors, function optim is used with method = "L-BFGS-B".

...

Additional parameters. When a unimodal_ prior family is used, these parameters are passed to function ash in package ashr. Although it does not call into ashr, the scale mixture of normals family accepts parameter gridmult for purposes of comparison. When gridmult is set, an ashr-style grid will be used instead of the default ebnm grid. When the "deconvolver" family is used, additional parameters are passed to function deconv in package deconvolveR. Families of generalized binary priors take several additional parameters; see ebnm_generalized_binary. In all other cases, additional parameters are ignored.

Functions

ebnm_output_default(): Lists the default return values.
ebnm_output_all(): Lists all valid return values.

Details

Given vectors of data x and standard errors s, ebnm solves the "empirical Bayes normal means" (EBNM) problem for various choices of prior family. The model is $$x_j | \theta_j, s_j \sim N(\theta_j, s_j^2)$$ $$\theta_j \sim g \in G,$$ where $g$, which is referred to as the "prior distribution" for $\theta$, is to be estimated from among some specified family of prior distributions $G$. Several options for $G$ are implemented, some parametric and others non-parametric; see below for examples.

Solving the EBNM problem involves two steps. First, $g \in G$ is estimated via maximum marginal likelihood: $$\hat{g} := \arg\max_{g \in G} L(g),$$ where $$L(g) := \prod_j \int p(x_j | \theta_j, s_j) g(d\theta_j).$$ Second, posterior distributions $p(\theta_j | x_j, s_j, \hat{g})$ and/or summaries such as posterior means and posterior second moments are computed.

Implemented prior families include:

point_normal: The family of mixtures where one component is a point mass at $\mu$ and the other is a normal distribution centered at $\mu$.
point_laplace: The family of mixtures where one component is a point mass at $\mu$ and the other is a double-exponential distribution centered at $\mu$.
point_exponential: The family of mixtures where one component is a point mass at $\mu$ and the other is a (nonnegative) exponential distribution with mode $\mu$.
normal: The family of normal distributions.
horseshoe: The family of horseshoe distributions.
normal_scale_mixture: The family of scale mixtures of normals.
unimodal: The family of all unimodal distributions.
unimodal_symmetric: The family of symmetric unimodal distributions.
unimodal_nonnegative: The family of unimodal distributions with support constrained to be greater than the mode.
unimodal_nonpositive: The family of unimodal distributions with support constrained to be less than the mode.
generalized_binary: The family of mixtures where one component is a point mass at zero and the other is a truncated normal distribution with lower bound zero and nonzero mode. See Liu et al. (2023), cited in References below.
npmle: The family of all distributions.
deconvolver: A non-parametric exponential family with a natural spline basis. Like npmle, there is no unimodal assumption, but whereas npmle produces spiky estimates for $g$, deconvolver estimates are much more regular. See deconvolveR-package for details and references.
flat: The "non-informative" improper uniform prior, which yields posteriors $$\theta_j | x_j, s_j \sim N(x_j, s_j^2).$$
point_mass: The family of point masses $\delta_\mu$. Posteriors are point masses at $\mu$.
ash: Calls into function ash in package ashr. Can be used to make direct comparisons of ebnm and ashr implementations of prior families such as scale mixtures of normals and the NPMLE.

References

Jason Willwerscheid and Matthew Stephens (2021). ebnm: an R Package for solving the empirical Bayes normal means problem using a variety of prior families. arXiv:2110.00152.

Yusha Liu, Peter Carbonetto, Jason Willwerscheid, Scott A Oakes, Kay F Macleod, and Matthew Stephens (2023). Dissecting tumor transcriptional heterogeneity from single-cell RNA-seq data by generalized binary covariance decomposition. bioRxiv 2023.08.15.553436.

Examples

Run this code

theta <- c(rep(0, 100), rexp(100))
s <- 1
x <- theta + rnorm(200, 0, s)

# The following are equivalent:
pn.res <- ebnm(x, s, prior_family = "point_normal")
pn.res <- ebnm_point_normal(x, s)

# Inspect results:
logLik(pn.res)
plot(pn.res)

# Fix the scale parameter:
pl.res <- ebnm_point_laplace(x, s, scale = 1)

# Estimate the mode:
normal.res <- ebnm_normal(x, s, mode = "estimate")
plot(normal.res) # posterior means shrink to a value different from zero

# Use an initial g (this fixes mode and scale for ash priors):
normalmix.res <- ebnm_normal_scale_mixture(x, s, g_init = pn.res)

# Fix g and get different output (including a posterior sampler):
pn.res <- ebnm_point_normal(x, s, g_init = pn.res, fix_g = TRUE,
                            output = ebnm_output_all())

# Sample from the posterior:
pn.samp <- simulate(pn.res, nsim = 100)

# Quantiles and HPD confidence intervals can be obtained via sampling:
set.seed(1)
pn.quantiles <- quantile(pn.res, probs = c(0.1, 0.9))
pn.quantiles[1:5, ]
confint(pn.res, level = 0.8, parm = 1:5)

# Examples of usage of control parameter:
#  point_normal uses nlm:
pn.res <- ebnm_point_normal(x, s, control = list(print.level = 1))
#  unimodal uses mixsqp:
unimodal.res <- ebnm_unimodal(x, s, control = list(verbose = TRUE))

Run the code above in your browser using DataLab