samplesize: FDR as a function of sample size

Description

This function tabulates the false discovery rate (FDR) for selecting differentially expressed genes as a function of sample size and cutoff level. Additionally, the same information can be displayed through an attractive plot.

Usage

samplesize(n = seq(5, 50, by = 5), p0 = 0.99, sigma = 1, D, F0, F1,  paired = FALSE, crit, crit.style = c("top percentage", "cutoff"),
		   plot =TRUE, local.show=FALSE, nplot = 100, ylim = c(0, 1), main,
		   legend.show = FALSE, grid.show = FALSE, ...)

Arguments

sample size (as subjects per group)

the proportion of non-differentially expressed genes

sigma

the standard deviation for the log expression values

assumed average log fold change (in units of sigma), by default 1; this is a shortcut for specifying a simple symmetrical alternative hypothesis through F1.

the distribution of the log2 expression values under the null hypothesis; by default, this is normal with mean zero and standard deviation sigma, but mixtures of normals can be specified, see Details and Examples.

the distribution of the log2 expression values under the alternative hypothesis; by default, this is an equal mixture of two normals with means D and -D and standard deviation sigma; mixture of normals are again possible, see Details and Examples.

paired

logical value indicating whether this is the independent sample case (default) or the paired sample case.

crit

a vector of cutoff values for selecting differentially expressed genes; the interpretation depends on crit.style.

crit.style

indicates how differentially expressed genes are selected: either by a fixed cutoff level for the absolute value of the t-statistic or as a fixed percentage of the absolute largest t-statistics.

plot

logical value indicating whether to do the plotting business

local.show

logical value indicating whether to show local or global false discovery rate (default: global).

nplot

number of points that are evaluated for the curves

ylim

the usual limits on the vertical axis

main

the main title of the plot

legend.show

logical value indicating whether to show a legend for the types of gene selection in the plot

grid.show

logical value indicating whether to draw grid lines showing the sample sizes n to be tabulated in the plot

...

the usual graphical parameters, passed to plot

Value

A matrix with rows corresponding to elements of n and columns corresponding to the specified critical values is returned. The matrix has the attribute param that contains the specified arguments, see Examples.

Details

This function plots the FDR as a function of the sample size when comparing the expression of multiple genes between two groups of subjects. This is based on a model assuming that a proportion p0 of genes is not differentially expressed (regulated) between groups, and that 1-p0 genes are. The logarithmized gene expression values of regulated and non regulated genes are assumed to be generated by mixtures of normal distributions; these mixtures can be specified through the parameters F0, F1 or D, and sigma; please see TOC for details on the model and the specification of the mixtures. By default, the null distribution of the log expression values is a normal centered on zero, and the alternative an equal mixture of normals centered at +D and -D.

The list of nominally differentially expressed genes can be selected in two ways:

all genes with absolute t-statistic larger than the specified critical cutoff values (cutoff),
all genes that represent the specified critical top percentage of the absolutely largest t-statistics (top percentage).

Multiple critical values correspond to multiple curves, each labeled by the critical value, but only one value can be specified for the proportion of non-regulated genes p0 and the standard deviation sigma.

References

Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A (2005) False Discovery Rate, Sensitivity and Sample Size for Microarray Studies. Bioinformatics, 21, 3017-3024.

Jung SH (2005) Sample size for FDR-control in microarray data analysis. Bioinformatics, 21, 3097-104.

Examples

Run this code

# Default assumes a proportion of 0.01 regulated genes equally split
# between two-fold up- and down-regulated
# We select the top 1, 2, 3 percent absolute largest t-statistics
samplesize(crit=c(0.03,0.02, 0.01))

# Same model, but using a hard cutoff for the t-statistics
samplesize(crit=2:4, crit.style="cutoff")

# Paired test of the same size has slightly better FDR (as expected)
samplesize(paired=TRUE)

# Compare the effect of p0 and effect size
par(mfrow=c(2,2))
samplesize(crit=c(0.03,0.02, 0.01), p0=0.95, D=1)
samplesize(crit=c(0.03,0.02, 0.01), p0=0.99, D=1)
samplesize(crit=c(0.03,0.02, 0.01), p0=0.95, D=2)
samplesize(crit=c(0.03,0.02, 0.01), p0=0.99, D=2)

# An asymmetric alternative distribution: 20 percent of the regulated genes 
# are expected to be (at least) four-fold up regulated
# NB, no graphical output
ret = samplesize(F1=list(D=c(-1,1,2), p=c(2,2,1)), p0=0.95, crit=0.05, plot=FALSE)
ret
# Look at the parameters
attr(ret, "param")

# A wide null distribution that allows to disregard genes with small effect
# Here: |log2 fold change| < 0.25, i.e. fold change of less than 19 percent
samplesize(F0=list(D=c(-0.25,0,0.25)), grid=TRUE)

# This is close to Example 3 in Jung's paper (see References):
# p0=0.99 and sensitivity=0.6, so we want a rejection rate of 
# around 0.006 from the top list.
# Here we require around 40 arrays/group, compared to 
# around 37 in Jung's paper, most likely because we use 
# the t-distribution instead of normal. Jung's alternative 
# is only one-sided, so the exact correspondence is
# 
samplesize(p0=0.99,crit.style="top", crit=0.006, F1=list(D=1, p=1), grid=TRUE) 
abline(h=0.01)

#The result is very close to the symmetric alternatives: 
samplesize(p0=0.99,crit=0.006, D=1, grid=TRUE, ylim=c(0,0.9))

Run the code above in your browser using DataLab