CodomMarker: Function to fit a mixture model to a vector of signal ratios of a single bi-allelic marker

Description

This function fits a specified mixture model to a vector of signal ratios of multiple samples for a single bi-allelic marker. Returns a list with results from the fitted mixture model.

Usage

CodomMarker(y, ng = 5, mutype = 0, sdtype = "sd.const", ptype = "p.free",  clus = TRUE, mu.start = NA, sd.start=rep(0.075,ng), sd.fixed = 0.075,  p = NA, maxiter = 500, maxn.bin = 200, nbin = 200, plothist = TRUE,  nbreaks = 40, maintitle = NULL, subtitle = NULL, xlabel = NULL,  xaxis = "s")

Arguments

the vector of signal ratios (each value is from one sample, vector y contains the values for 1 marker). All values must be between 0 and 1 (inclusive), NAs are not allowed. The minimum length of y is 10*ng.

the number of possible genotypes (mixture components) to be fitted: one more than the ploidy of the samples

mutype

an integer in 0:6. Describes how to fit the means of the components of the mixture model: with mutype=0 the means are not constrained, requiring ng degrees of freedom. With mutype in 1:6 the means are constrained based on the ng possible allele ratios according to one of 6 models; see Details.

sdtype

one of "sd.const", "sd.free", "sd.fixed". Describes how to fit the standard deviations of the components of the mixture model: with "sd.const" all standard deviations (on the transformed scale) are equal (requiring 1 degree of freedom); with "sd.free" all standard deviations are fitted separately (ng d.f.); with "sd.fixed" all sd's on the transformed scale are equal to parameter sd.fixed (0 d.f.).

ptype

one of "p.free", "p.fixed" or "p.HW". Describes how to fit the mixing proportions of the components of the mixture model: with "p.free", the proportions are not constrained (and require ng-1 degrees of freedom); with "p.fixed" the proportions given in parameter p are fixed; with "p.HW" the proportions are calculated from the overall allele frequency, requiring only 1 degree of freedom.

clus

boolean. If TRUE, the initial means and standard deviations are based on a kmeans clustering into ng groups. If false, the initial means are equally spaced on the transformed scale between the values corresponding to 0.02 and 0.98 on the original scale and the initial standard deviations are 0.075 on the transformed scale.

mu.start

vector of ng values. If present, gives the start values of mu (the means of the mixture components) on the original (untransformed) scale, must be strictly ascending (mu[i]>mu[i-1]). Overrides the start values determined by clus TRUE or FALSE.

sd.start

vector of ng values. If present, gives the start values of sd (the standard deviations of the mixture components) on the transformed scale. Overrides the start values determined by clus TRUE or FALSE.

sd.fixed

vector, recycled if less than ng values: if argument sdtype is "sd.fixed", argument sd.fixed specifies the fixed standard deviations.

a vector of ng elements with the initial (or fixed, if parameter ptype is "p.fixed") mixing proportions of the mixture model components.

maxiter

integer: the maximum number of times the nls function is called in CodomMarker (0 = no limit, default=500)

maxn.bin

integer, default=200: if the length of y is larger than max.nbin the values of y (after arcsine square root transformation) are binned (i.e. the range of y (0 to pi/2) is divided into nbin bins of equal width and the number of y values in each bin is used as the weight of the midpoints of each bin). This results in significant speed improvement with large numbers of samples without noticeable effects on model fitting.

nbin

integer, default=200: the number of bins(see maxn.bin)

plothist

If TRUE a histogram of y is plotted with the fitted distributions superimposed

nbreaks

number of breaks for plotting the histogram; does not have an effect on fitting the mixture model

maintitle

string, used for plotting

subtitle

string, used for plotting

xlabel

string, used for plotting

xaxis

string, used for plotting: if "n" no x-axis is plotted

Value

message: the error message
loglik: the optimized log-likelihood
npar: the number of fitted parameters
AIC: Akaike's Information Criterion
BIC: Bayesian Information Criterion
psi: a list with components mu, sigma and p: each a vector of length ng with the means, standard deviations and mixing proportions of the components of the fitted mixture model;the means and standard deviations are on the transformed scale
post: a matrix of ng columns and length(y) rows; each row r gives the ng probabilities that the y[r] belongs to the ng components
nobs: the number of observations in y (excluding NA's and possibly removed outliers)
iter: the number of iterations
message: an error message, "" if no error
back: a list with components mu.back and sigma.back: each a vector of length ng with the means and standard deviations of the mixture model components back-transformed to the original scale.

Details

This function takes as input a vector of ratios of the signals of two alleles (a and b) at a genetic marker locus (ratios as a/(a+b)), one for each sample, and fits a mixture model with ng components (for a tetraploid species: ng=5 components representing the nulliplex, simplex, duplex, triplex and quadruplex genotypes). Ideally these signal ratios should reflect the possible allele ratios (for a tetraploid: 0, 0.25, 0.5, 0.75, 1) but in real life they show a continuous distribution with a number of more or less clearly defined peaks. The arguments specify what model to fit and with what values the iterative fitting process should start. If the argument mutype is set to a value in 1:6 the means of the mixture model components are constrained based on the possible allele ratios. This constraint takes the form of one of 6 possible models, specified by mutype, as follows: 1: a basic model assuming that both allele signals have a linear response to the allele dosage; one parameter for the ratio of the slopes of the two signal responses, and two parameters for the background levels (intercepts) of both signals (total 3 parameters). 2: as 1, but with the same background level for both signals (2 parameters) 3: as 1, with two parameters for a quadratic effect in the signal responses (5 parameters) 4: as 3, but with the same background level for both signals (4 parameters) 5: as 3, but with the same quadratic parameter for both signal responses (4 parameters) 6: as 5, but with the same background level for both signals (3 parameters)

References

Voorrips, RE, G Gort, B Vosman (2011) Genotype calling in tetraploid species from bi-allelic marker data using mixture models BMC Bioinformatics 12: 172.

Examples

Run this code

data(tetra.potato.SNP)
SNP6 <- subset(tetra.potato.SNP, MarkerName=='PotSNP006')
# Single marker, single mixture model
rawratio <- SNP6$X_Raw/(SNP6$X_Raw+SNP6$Y_Raw) 
unmix <- CodomMarker(rawratio)

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

Details

References

See Also

Examples