Learn R Programming

scrime (version 1.3.5)

simulateSNPglm: Simulation of SNP data

Description

Simulates SNP data. Interactions of some of the simulated SNPs are then used to specify either a binary or a quantitative response by a logistic or linear regression model, respectively.

Usage

simulateSNPglm(n.obs = 1000, n.snp = 50, list.ia = NULL, list.snp = NULL, 
   beta0 = -0.5, beta = 1.5, maf = 0.25, sample.y = TRUE, p.cutoff = 0.5, 
   err.fun = NULL, rand = NA, ...)

Arguments

n.obs

number of observations that should be generated.

n.snp

number of SNPs that should be generated.

list.ia

a list consisting of numeric vectors (or values) specifying the genotypes of the SNPs that should be explanatory for the response. Each of these vectors must be composed of some of the numbers -3, -2, -1, 1, 2, 3, where 1 denotes the homozygous reference genotype, 2 the heterozygous genotype, and 3 the homozygous variant genotype, and a minus before these numbers means that the corresponding SNP should be not of this genotype. If, e.g., one of the vectors is given by c(1, -1, -3) and the corresponding vector in list.snp is c(5, 7, 8), then the corresponding interaction used in the regression model to specify the response is

(SNP5 == 1) & (SNP7 != 1) & (SNP8 != 3).

For more details, see Details. Must be specified if list.snp is specified. If both list.ia and list.snp are NULL, then the interactions shown in the Details section are used.

list.snp

a list consisting of numeric vectors specifying the SNPs that compose the interactions used in the regression model. Each of these vectors must have the same length as the corresponding vector in list.ia, and must consist of integers between 1 and n.snp, where the integer \(i\) corresponds to the \(i\)th column of the simulated SNP matrix. If list.ia is specified but not list.snp, then the first \(n\) SNPs are used to generate the interactions, where \(n\) is the total number of values in list.ia. For the case that both list.ia and list.snp are not specified, see Details.

beta0

a numeric value specifying the intercept of the regression model.

beta

a non-negative numeric value or vector of the same length as list.ia (i.e.\ one numeric value for each interaction) specifying the parameters in the regression model.

maf

either an integer, or a vector of length 2 or n.snp specifying the minor allele frequency. If an integer, all the SNPs will have the same minor allele frequency. If a vector of length n.snp, each SNP will have the minor allele frequency specified in the corresponding entry of maf. If length 2, then maf is interpreted as the range of the minor allele frequencies, and for each SNP, a minor allele frequency will be randomly drawn from a uniform distribution with the range given by maf.

sample.y

should the values of the response in the logistic regression model be randomly drawn using the probabilities of the respective observations for being a case? If FALSE, then the response value of an observation is 1 if its probability for being a case is larger than p.cutoff, and otherwise the observation is classified as a control (i.e.\ the value of the response is 0). Ignored if err.fun is specified.

p.cutoff

a probability, i.e.\ a numeric value between 0 and 1, naming the cutoff for an observation to be called a case if sample.y = FALSE. For details, see sample.y. Ignored if sample.y = TRUE or err.fun is specified.

err.fun

a function for generating the error that is added to the linear model to determine the value of the (quantitative) response. If NULL, a logistic regression model is fitted. If specified, a linear model is fitted. Therefore, this argument is used to differ between the two types of models. The specified function must have as first argument the number of values that should be generated and as output a vector consisting of these values. Further arguments can also be specified because of in simulateSNPglm. If, e.g., err.fun = rnorm, then rnorm(n.obs) will be called to generate n.obs observations from a standard normal function.

rand

a numeric value for setting the random number generator in a reproducible state.

further arguments of the function specified by err.fun.

Value

An object of class simSNPglm consisting of

x

a matrix with n.obs rows and n.snp columns containing the simulated SNP values.

y

a vector of length n.obs composed of the values of the response.

beta0

the value of the intercept.

beta

the vector of parameters.

ia

a character vector naming the explanatory interactions.

maf

a vector of length n.snp composed of the minor allele frequencies.

prob

a vector of length n.obs consisting of the values of Prob(\(Y = 1\)) (if err.fun = NULL).

err

a vector of length n.obs composed of the values of the error (if err.fun is specified).

p.cutoff

the value of p.cutoff (if err.fun = NULL and sample.y = FALSE).

err.call

a character string naming the call of the error function (if err.fun is specified).

Details

simulateSNPglm first simulates a matrix consisting of n.obs observations and n.snp SNPs, where the minor allele frequencies of these SNPs are given by maf.

Note that all SNPs are currently simulated independently of each other such that they are unlinked.

Afterwards, the response is determined by a regression model using the specifications of list.ia, list.snp, beta0 and beta. Depending on whether err.fun is specified or not, a linear or a logistic regression model is used, respectively, i.e.\ the response \(Y\) is continuous or binary.

By default, a logistic regression model

logit(Prob(\(Y = 1\))) = beta0 + beta[1] * \(L_1\) + beta[2] * \(L_2\) + …

is fitted, since err.fun = NULL.

If both list.ia and list.snp are NULL, then interactions similar to the one considered, e.g., in Nunkesser et al.\ (2007) or Schwender et al.\ (2007) are used, i.e.\

\(L_1\) = (SNP6 != 1) & (SNP7 == 1)\ \ \ and

\(L_2\) = (SNP3 == 1) & (SNP9 == 1) & (SNP10 == 1),

by setting list.ia = list(c(-1, 1), c(1, 1, 1)) and list.snp = list(c(6, 7), c(3, 9, 10)).

Using the above model Prob(\(Y = 1\)) is computed for each observation, and its value of the response is determined either by a random draw from a Bernoulli distribution using this probability (if sample.y = TRUE), or by evaluating if Prob(\(Y = 1\)) \(>\) p.cutoff (if sample.y = FALSE).

If err.fun is specified, then the linear model

\(Y\) = beta0 + beta[1] * \(L_1\) + beta[2] * \(L_2\) + … + error

is used to determine the values of the response \(Y\), where the values for error are given by the output of a call of err.fun.

References

Nunkesser, R., Bernholt, T., Schwender, H., Ickstadt, K. and Wegener, I.\ (2007). Detecting High-Order Interactions of Single Nucleotide Polymorphisms Using Genetic Programming. Bioinformatics, 23, 3280-3288.

Schwender, H.\ (2007). Statistical Analysis of Genotype and Gene Expression Data. Dissertation, Department of Statistics, University of Dortmund.

See Also

simulateSNPs, summary.simSNPglm, simulateSNPcatResponse

Examples

Run this code
# NOT RUN {
# The simulated data set described in Details.

sim1 <- simulateSNPglm()
sim1

# A bit more information: Table of probabilities of being a case
# vs. numbers of cases and controls.

summary(sim1)

# Calling an observation a case if its probability of being
# a case is larger than 0.5 (the default for p.cutoff).

sim2 <- simulateSNPglm(sample.y = FALSE)
summary(sim2)

# If ((SNP4 != 2) & (SNP3 == 1)), (SNP5 ==3) and
# ((SNP12 !=1) & (SNP9 == 3)) should be the three interactions
# (or variables) that are explanatory for the response,
# list.ia and list.snp are specified as follows.

list.ia <- list(c(-2, 1), 3, c(-1,3))
list.snp <- list(c(4, 3), 5, c(12,9))

# The binary response and the data set consisting of 
# 600 observations and 25 SNPs, where the minor allele
# frequency of each SNP is randomly drawn from a
# uniform distribution with minimum 0.1 and maximum 0.4,
# is then generated by

sim3 <- simulateSNPglm(n.obs = 600, n.snp = 25,
  list.ia = list.ia, list.snp = list.snp, maf = c(0.1, 0.4))
sim3

summary(sim3)
  
# If the response should be quantitative, err.fun has
# to be specified. To use a normal distribution with mean 0
# (default in rnorm) and a standard deviation of 2 
# as the distribution of the error, call

simulateSNPglm(err.fun = rnorm, sd = 2)
 

# }

Run the code above in your browser using DataLab