CountDataSet: Generate a simulated sequencing data set using a negative binomial model.
Description
Generate two nxp data sets: a training set and a test set, as well as
outcome vectors y and yte of length n indicating the class labels of the
training and test observations.
Usage
CountDataSet(n, p, K, param, sdsignal)
Arguments
n
Number of observations desired.
p
Number of features desired. Note that 30% of the features will differ
between classes, though some of those differences may be small.
K
Number of classes desired. Note that the function requires that n be at
least equal to 4K -- i.e. there must be at least 4 observations per
class on average.
param
The dispersion parameter for the negative binomial distribution. The
negative binomial distribution is parameterized using "mu" and "size" in
the R function "rnbinom". That is, Y ~ NB(mu, param) means that E(Y)=mu
and Var(Y) = mu+mu^2/param.
So when param is very large this is essentially a Poisson distribution,
and when param is smaller then there is a lot of overdispersion relative
to the Poisson distribution.
sdsignal
The extent to which the classes are different. If this equals zero then
there are no class differences and if this is large then the classes are
very different.
Value
x
nxq data matrix. May have q<p because features with 0 total
counts are removed.
y
class labels for the n observations in x.
xte
nxq data matrix of test observations; the q features are
those with >0 total counts in x. So q<=p.
yte
class labels for the n observation in xte.
Details
This is based in part on a function in the DESeq Bioconductor package
(Anders and Huber 2010 Genome Biology) for generating a simulated RNA
sequencing data set.