Learn R Programming

TCC (version 1.12.1)

simulateReadCounts: Generate simulation data from negative binomial (NB) distribution

Description

This function generates simulation data with arbitrary defined experimental condition.

Usage

simulateReadCounts(Ngene = 10000, PDEG = 0.20, DEG.assign = NULL, DEG.foldchange = NULL, replicates = NULL, group = NULL, fc.matrix = NULL)

Arguments

Ngene
numeric scalar specifying the number of genes.
PDEG
numeric scalar specifying the proportion of differentially expressed genes (DEGs).
DEG.assign
numeric vector specifying the proportion of DEGs up- or down-regulated in individual groups to be compared. The number of elements should be the same as that of replicates if replicates is specified. The indication of replicates means a single-factor experimental design. The number of elements in DEG.assign should be the same as the number of columns in DEG.foldchange. Both DEG.foldchange as data frame and group should simultaneously be specified and those indication means a multi-factor experimental design.
DEG.foldchange
numeric vector for single-factor experimental design and data frame for multi-factor experimental design. Both DEG.foldchange as numeric vector and replicates should simultaneously be specified for single-factor experimental design. The $i$-th element in DEG.foldchange vector indicates the degree of fold-change for Group $i$. The default is DEG.foldchange = c(4, 4), indicating that the levels of DE are four-fold in both groups. Both DEG.foldchange as data frame and group should simultaneously be specified for multi-factor experimental design. Numeric values in the DEG.foldchange object indicate the degree of fold-change for individual conditions or factors.
replicates
numeric vector indicating the numbers of (biological) replicates for individual groups compared. Ignored if group is specified.
group
data frame specifying the multi-factor experimental design.
fc.matrix
fold change matrix generated by makeFCMatrix for simulating DEGs with the fold-change under un-uniform distributions.

Value

A TCC-class object containing following fields:
count
numeric matrix of simulated count data.
group
data frame indicating which group (or condition or factor) each sample belongs to.
norm.factors
numeric vector as a placeholder for normalization factors.
stat
list for storing results after the execution of the calcNormFactors (and estimateDE) function.
estimatedDEG
numeric vector as a placeholder for indicating which genes are up-regulated in particular group compared to the others. The values in this field will be populated after the execution of the estimateDE function.
simulation
list containing four fields: trueDEG, DEG.foldchange, PDEG, and params. The trueDEG field (numeric vector) stores information about DEGs: 0 for non-DEG, 1 for DEG up-regulated in Group 1, 2 for DEG up-regulated in Group 2, and so on. The information for the remaining three fields is the same as those indicated in the corresponding arguments.

Details

The empirical distribution of read counts used in this function is calculated from a RNA-seq dataset obtained from Arabidopsis data (three biological replicates for both the treated and non-treated samples), the arab object, in NBPSeq package (Di et al., 2011). The overall design about the simulation conditions introduced can be viewed as a pseudo-color image by the plotFCPseudocolor function.

Examples

Run this code
# Generating a simulation data for comparing two groups
# (G1 vs. G2) without replicates (single-factor experimental design). 
# the levels of DE are 3-fold in G1 and 7-fold in G2.
tcc <- simulateReadCounts(Ngene = 10000, PDEG = 0.2, 
                         DEG.assign = c(0.9, 0.1),
                         DEG.foldchange = c(3, 7),
                         replicates = c(1, 1))
dim(tcc$count)
head(tcc$count)
str(tcc$simulation)
head(tcc$simulation$trueDEG)


# Generating a simulation data for comparing three groups
# (G1 vs. G2 vs. G3) with biological replicates
# (single-factor experimental design).
# the first 3000 genes are DEGs, where the 70%, 20%, and 10% are
# up-regulated in G1, G2, G3, respectively. The levels of DE are
# 3-, 10-, and 6-fold in individual groups.
tcc <- simulateReadCounts(Ngene = 10000, PDEG = 0.3, 
                         DEG.assign = c(0.7, 0.2, 0.1),
                         DEG.foldchange = c(3, 10, 6), 
                         replicates = c(2, 4, 3))
dim(tcc$count)
head(tcc$count)
str(tcc$simulation)
head(tcc$simulation$trueDEG)


# Generating a simulation data consisting of 10,000 rows (i.e., Ngene = 10000)
# and 8 columns (samples) for two-factor experimental design
# (condition and time). The first 3,000 genes are DEGs (i.e., PDEG = 0.3).
# Of the 3,000 DEGs, 40% are differentially expressed in condition (or GROUP) "A"
# compared to the other condition (i.e., condition "B"), 40% are differentially
# expressed in condition (or GROUP) "B" compared to the other condition
# (i.e., condition "A"), and the remaining 20% are differentially expressed at
# "10h" in association with the second factor: DEG.assign = c(0.4, 0.4, 0.2).
# The levels of fold-change are (i) 2-fold up-regulation in condition "A" for
# the first 40% of DEGs, (ii) 4-fold up-regulation in condition "B" for the
# second 40%, and (iii) 0.4- and 0.6-fold up-regulation at "10h" in "A" and
# 5-fold up-regulation at "10h" in "B".

group <- data.frame(
   GROUP = c( "A",  "A",   "A",   "A",  "B",  "B",   "B",   "B"),
   TIME  = c("2h", "2h", "10h", "10h", "2h", "2h", "10h", "10h")
)
DEG.foldchange <- data.frame(
   FACTOR1 = c(2, 2,   2,   2, 1, 1, 1, 1),
   FACTOR1 = c(1, 1,   1,   1, 4, 4, 4, 4),
   FACTOR2 = c(1, 1, 0.4, 0.6, 1, 1, 5, 5)
)
tcc <- simulateReadCounts(Ngene = 10000, PDEG = 0.3,
                          DEG.assign = c(0.4, 0.4, 0.2),
                          DEG.foldchange = DEG.foldchange,
                          group = group)
tcc

Run the code above in your browser using DataLab