TOC: Theoretical FDR and sensitivity as a function of cutoff level

Description

Computes and plots the operating characteristics for a two group microarray experiment based on a theoretical model. The false discovery rate (FDR) is plotted against the cutoff level on the t-statistic. Optionally, curves for the the classical significance level and sensitivity can be added. Different curves for different proportions of non-differentially expressed genes can be compared in the same plot, and the sample size per group can be varied between plots.

Usage

TOC(n = 10, p0 = 0.95, sigma = 1, D, F0, F1, n1 = n, n2 = n, paired = FALSE, plot = TRUE, local.show=FALSE, alpha.show = TRUE, sensitivity.show = TRUE,
	nplot = 100, xlim, ylim = c(0, 1), main, legend.show = FALSE, ...)

Arguments

n, n1, n2

number of samples per group, by default equal and specified via n, but can be set to different values via n1 and n2.

the proportion of not differentially expressed genes, may be vector valued

sigma

the standard deviation for the log expression values

assumed average log fold change (in units of sigma), by default 1; this is a shortcut for specifying a simple symmetrical alternative hypothesis through F1.

the distribution of the log2 expression values under the null hypothesis; by default, this is normal with mean zero and standard deviation sigma, but mixtures of normals can be specified, see Details and Examples.

the distribution of the log2 expression values under the alternative hypothesis; by default, this is an equal mixture of two normals with means D and -D and standard deviation sigma; mixture of normals are again possible, see Details and Examples.

paired

logical value indicating whether two distinct groups of observations or one group of paired observations are studied.

plot

logical value indicating whether the results should be plotted.

local.show

logical value indicating whether to show local or global false discovery rate (default: global).

alpha.show

logical value indicating whether to show the classical significance level for testing one hypothesis as a function of the cutoff level.

sensitivity.show

logical value indicating whether to show the classical sensitivity for testing one hypothesis as a function of the cutoff level.

nplot

number of points that are evaluated for the curves

xlim

the usual limits on the horizontal axis

ylim

the usual limits on the vertical axis

main

the main title of the plot

legend.show

logical value indicating whether to show a legend for the different types of curves in the plot.

...

the usual graphical parameters, passed to plot

Value

This function returns invisibly a data frame with nplot rows whose columns contain the information for the individual curves. The number of columns and their names will depend on the number and value of the p0 specified, and whether alpha and sensitivity are displayed. Additionally, the returned data frame has an attribute param, which is a list with all the non-plotting arguments to the function.

Details

This function plots the FDR as a function of the cutoff level when comparing the expression of multiple genes between two groups of subjects. We study a gene selection mechanism that declares all genes to be differentially expressed whose t-statistics have an absolute value greater than a specified cutoff value. The comparison is based on a two-sample t-statistic for equal variances, for either paired or unpaired observations.

The underlying model assumes that a proportion p0 of genes are not differentially expressed between groups, and that 1-p0 are. The logarithmized gene expression values are assumed to be generated by mixtures of normal distributions. Both null and alternative hypothesis are specified through the means of the respective mixture components; these means can be interpreted as average log2 fold changes in units of the standard deviation sigma.

Note that the model does not assume that all genes have the same standard deviation sigma, only that the mean log2 fold change for all regulated genes is proportional to their individual variability (standard deviation). sigma generally does not need to be specified explicitly and can be left at its default value of one, so that D can be interpreted straightforward as log2 fold change between groups.

The default null distribution of the log2 expression values is a single normal distribution with mean zero (and standard deviation sigma); the default alternative distribution is is an equal mixture of two normals with means D and -D (and again standard deviation sigma). However, general mixtures of normals can be specified for both null and alternative distribution through F0 and F1, respectively: both are lists with two elements:

D is the vector of means (i.e. log2 fold changes),
p is the vector of mixing proportions for the means.

If present, p must be the same length as D; its elements do not need to be normalized, i.e. sum to one; if absent, equal mixing is assumed, see Examples. A wide (mixture) null hypothesis, or an empirical null hypothesis as outlined by Efron (2004), can be used if genes with log fold changes close to zero are thought to be of no biological interest, and are counted as effectively not regulated. Similarly, the alternative hypothesis can be any mixture of large and small effects, symmetric or non-symmetric, depending on the expected regulation patterns, see Examples.

As a consequence, both the null distribution of the t-statistics (for the unregulated genes) and their alternative distribution (for the regulated genes) are mixtures of (generally non-central) t-distributions, see FDR.

Sample size n and standard deviation sigma are atomic values, but multiple p0 can be specified, resulting in multiple curves. Additionally, the usual significance level and sensitivity for a classical one-hypothesis can be displayed.

References

Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A. (2005) False Discovery Rate, Sensitivity and Sample Size for Microarray Studies. Bioinformatics, 21, 3017-3024.

Efron, B. (2004) Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis. JASA, 99, 96-104.

Examples

Run this code

# Default null and alternative distributions, assuming different proportions
# of regulated genes
TOC(p0=c(0.90, 0.95, 0.99), legend.show=TRUE)

# The effect of sample size and effect size
par(mfrow=c(2,2))
TOC(p0=c(0.90, 0.95, 0.99), n=5, D=1)
TOC(p0=c(0.90, 0.95, 0.99), n=30, D=1)
TOC(p0=c(0.90, 0.95, 0.99), n=5, D=2)
TOC(p0=c(0.90, 0.95, 0.99), n=30, D=2)

# A wide null distribution that allows to disregard genes of small effect
# unspecified p means equal mixing proportions
ret = TOC(F0=list(D=c(-0.25,0,0.25)), main="Wide F0") 
attr(ret,"param")$F0 # the null hypothesis

# An extended (and unsymmetric) alternative
ret = TOC(F1=list(D=c(-2,-1,1), p=c(1,2,2)), p0=0.95, main="Unsymmetric F1")
attr(ret,"param")$F1 # F1$p is normalized

# Unequal sample sizes
TOC(n1=10, n2=30)

# Curves for a paired t-test
TOC(paired=TRUE)

# The output contains all the x- and y-coordinates
ret = TOC(p0=c(0.90, 0.95, 0.99), main="Default settings")
dim(ret)
colnames(ret)
ret[1:10,]
# Additionally, the list of arguments that determine the experiment
attr(ret,"param")

Run the code above in your browser using DataLab