gen.data: Simulate data sets

Description

The functions gen.data and gen.data2 generate one or more two-class data matrices where the first nbiom variables are changed in the treatment class. The aim is to provide an easy means to evaluate the performance of biomarker identification methods. Function gen.data samples from a multivariate normal distribution; gen.data2 generates spiked data either by adding differences to the first columns, or by multiplying with factors given by the user. Note that whereas gen.data will provide completely new simulated data, both for the control and treatment classes, gen.data2 essentially only changes the biomarker part of the treated class.

Usage

gen.data(ncontrol, ntreated = ncontrol, nvar, nbiom = 5, group.diff = 0.5, nsimul = 100, means = rep(0, nvar), cormat = diag(nvar))
gen.data2(X, ncontrol, nbiom, spikeI, type = c("multiplicative", "additive"), nsimul = 100, stddev = .05)

Arguments

ncontrol, ntreated

Numbers of objects in the two classes. If only ncontrol is given, the two classes are assumed to be of equal size, or, in the case of gen.data2, the remainder of the samples are taken to be the treatment samples.

nvar

Number of variables.

nbiom

Number of biomarkers, i.e. the number of variables to be changed in the treatment class compared to the control class. The variables that are changed are always the first variables in the data matrix.

group.diff

group difference; the average difference between values of the biomarkers in the two classes.

nsimul

Number of data sets to simulate.

means

Mean values of all variables, a vector.

cormat

Correlation matrix to be used in the simulation. Default is the identity matrix.

Experimental data matrix, without group differences.

spikeI

A vector of at least three different numbers, used to generate new values for the biomarker variables in the treated class.

type

Whether to use multiplication (useful when simulating cases where things like "twofold differences" are relevant), or addition (in the case of absolute differences in the treatment and control groups).

stddev

Additional noise: in every simulation, normally distributed noise with a standard deviation of stddev * mean(spikeI) will be added to spikeI before generating the actual simulated data.

Value

X: An array of dimension nobj1 + nobj2 times nvar times nsimul.
Y: The class vector.
n.biomarkers: The number of biomarkers.

Details

The spikeI argument in function gen.data2 provides the numbers that will be used to artificially "spike" the biomarker variables, either by multiplication (the default) or by addition. To obtain approximate two-fold differences, for example, one could use spikeI = c(1.8, 2.0, 2.2). At least three different values should be given since in most cases more than one set will be simulated and we require different values in the biomarker variables.

Examples

Run this code

## Not run: 
# X <- gen.data(10, nvar = 200)
# names(X)
# dim(X$X)
# 
# set.seed(7)
# simdat <- gen.data(10, nvar = 1200, nbiom = 22, nsimul = 1,
#                    group.diff = 2)
# simdat.stab <- get.biom(simdat$X[,,1], simdat$Y, fmethod = "all",
#                         type = "stab", ncomp = 3, scale.p = "auto")
# ## show LASSO success
# traceplot(simdat.stab, lty = 1, col = rep(2:1, c(22, 1610)))
# 
# data(SpikePos)
# real.markers <- which(SpikePos$annotation$found.in.standards > 0)
# X.no.diff <- SpikePos$data[1:20, -real.markers]
# 
# set.seed(7)
# simdat2 <- gen.data2(X.no.diff, ncontrol = 10, nbiom = 22,
#                      spikeI = c(1.2, 1.4, 2), nsimul = 1)
# simdat2.stab <- get.biom(simdat2$X[,,1], simdat$Y,
#                          fmethod = "all", type = "stab", ncomp = 3,
#                          scale.p = "auto")
# ## show LASSO success
# traceplot(simdat2.stab, lty = 1, col = rep(2:1, c(22, 1610)))
# ## End(Not run)

Run the code above in your browser using DataLab