runEMclonalCN: Function to run the Expectation Maximization Algorithm in TitanCNA.

Description

Function to run the Expectation Maximization Algorithm for inference of model parameters: cellular prevalence, normal proportion, tumour ploidy. This is a key function in the TitanCNA package and is the most computationally intense. This function makes calls to a C subroutine that allows the algorithm to be run more efficiently.

Usage

runEMclonalCN(data, gParams, nParams, pParams, sParams, txnExpLen = 1e15, txnZstrength = 5e05, maxiter = 15, maxiterUpdate = 1500, pseudoCounts = 1e-300, normalEstimateMethod = "map", estimateS = TRUE,  estimatePloidy = TRUE, useOutlierState = FALSE, verbose = TRUE)

Arguments

data

list object that contains the components for the data to be analyzed. chr, posn, ref, and tumDepth that can be obtained using loadAlleleCounts, and logR that can be obtained using correctReadDepth and getPositionOverlap (see Example).

gParams

list object that contains the copy number and allelic ratio genotype parameters. Can be obtained from loadDefaultParameters.

nParams

list object that contains the normal contamination parameters. Can be obtained from loadDefaultParameters.

pParams

list object that contains the tumour ploidy parameters. Can be obtained from loadDefaultParameters.

sParams

list object that contains the subclonality (cellular prevalence and clonal cluster) parameters. Can be obtained from loadDefaultParameters.

txnExpLen

Influences prior probability of genotype transitions in the HMM. Smaller value have lower tendency to change state; however, too small and it produces underflow problems. 1e-9 works well for up to 3 million total positions.

txnZstrength

Influences prior probability of clonal cluster transitions in the HMM. Smaller value have lower tendency to change clonal cluster state. 1e-9 works well for up to 3 million total positions.

pseudoCounts

Small, machine precision values to add to probabilities to avoid underflow. For example, .Machine$double.eps.

maxiter

Maximum number of expectation-maximization iterations allowed. In practice, for TitanCNA, it will usually not exceed 20.

maxiterUpdate

Maximum number of coordinate descent iterations during the M-step (of EM algorithm) when parameters are estimated.

normalEstimateMethod

Specifies how to handle normal proportion estimation. Using map will use the maximum a posteriori estimation. Using fixed will not estimate the normal proportion; the normal proportion will be fixed to whatever is specified in params$normalParams$n_0. See Details.

estimateS

Logical indicating whether to account for clonality and estimate subclonal events. See Details.

estimatePloidy

Logical indicating whether to estimate and account for tumour ploidy.

useOutlierState

Logical indicating whether an additional outlier state should be used. In practice, this is usually not necessary.

verbose

Set to FALSE to suppress program messages.

Value

n: Converged estimate for normal contamination parameter. numeric array containing estimates at each EM iteration.
s: Converged estimate(s) for cellular prevalence parameter(s). This value is defined as the proportion of tumour sample that does not contain the aberrant genotype. This will contrast what is output in outputTitanResults. numeric array containing estimates at each EM iteration. If more than one cluster is specified, then s is a numeric matrix.
var: Converged estimates for variance parameter of the Gaussian mixtures used to model the log ratio data. numeric matrix containing estimates at each EM iteration.
phi: Converged estimate for tumour ploidy parameter. numeric array containing estimates at each EM iteration.
piG: Converged estimate for initial genotype state distribution. numeric matrix containing estimates at each EM iteration.
piZ: Converged estimate for initial clonal cluster state distribution. numeric matrix containing estimates at each EM iteration.
muR: Mean of binomial mixtures computed as a function of s and n. numeric matrix containing estimates at each EM iteration. See References for mathematical details.
muC: Mean of Gaussian mixtures computed as a function of s, n, and phi. numeric matrix containing estimates at each EM iteration. See References for mathematical details.
loglik: Posterior Log-likelihood that includes data likelihood and the priors. numeric array containing estimates at each EM iteration.
rhoG: Posterior marginal probabilities for the genotype states computed during the E-step. Only the final iteration is returned as a numeric matrix.
rhoZ: Posterior marginal probabilities for the clonal cluster states computed during the E-step. Only the final iteration is returned as a numeric matrix.
genotypeParams: Original genotype parameters. See loadDefaultParameters.
ploidyParams: Original tumour ploidy parameters. See loadDefaultParameters.
normalParams: Original normal contamination parameters. See loadDefaultParameters.
clonalParams: Original subclonal parameters. See loadDefaultParameters.
txnExpLen: Original genotype transition expected length. See loadDefaultParameters.
txnZstrength: Original clonal cluster transition expected length. See loadDefaultParameters.
useOutlierState: Original setting indicating usage of outlier state. See loadDefaultParameters.

Details

This function is implemented with the "foreach" package and therefore supports parallelization. See "doMC" or "doMPI" for some parallelization packages. The forwards-backwards algorithm is used for the E-step in the EM algorithm. This is done using a call to a C subroutine for each chromosome. The maximization step uses maximum a posteriori (MAP) for estimation of parameters.

If the sample has absolutely no normal contamination, then assign nParams$n_0 <- 0 and use argument normalEstimateMethod="fixed". estimateS should always be set to TRUE. If no subclonality is expected, then use loadDefaultParameters(numberClonalClusters=1). Using estimateS=FALSE and loadDefaultParameters(numberClonalClusters=0) is gives more or less the same results.

References

Ha, G., Roth, A., Khattra, J., Ho, J., Yap, D., Prentice, L. M., Melnyk, N., McPherson, A., Bashashati, A., Laks, E., Biele, J., Ding, J., Le, A., Rosner, J., Shumansky, K., Marra, M. A., Huntsman, D. G., McAlpine, J. N., Aparicio, S. A. J. R., and Shah, S. P. (2014). TITAN: Inference of copy number architectures in clonal cell populations from tumour whole genome sequence data. Genome Research, 24: 1881-1893. (PMID: 25060187)

Examples

Run this code

message('Running TITAN ...')
#### LOAD DATA ####
infile <- system.file("extdata", "test_alleleCounts_chr2.txt", 
              package = "TitanCNA")
data <- loadAlleleCounts(infile)

#### LOAD PARAMETERS ####
message('titan: Loading default parameters')
numClusters <- 2
params <- loadDefaultParameters(copyNumber = 5, 
              numberClonalClusters = numClusters, skew = 0.1)

#### READ COPY NUMBER FROM HMMCOPY FILE ####
message('titan: Correcting GC content and mappability biases...')
tumWig <- system.file("extdata", "test_tum_chr2.wig", package = "TitanCNA")
normWig <- system.file("extdata", "test_norm_chr2.wig", package = "TitanCNA")
gc <- system.file("extdata", "gc_chr2.wig", package = "TitanCNA")
map <- system.file("extdata", "map_chr2.wig", package = "TitanCNA")
cnData <- correctReadDepth(tumWig, normWig, gc, map)
logR <- getPositionOverlap(data$chr, data$posn, cnData)
data$logR <- log(2^logR) #transform to natural log

#### FILTER DATA FOR DEPTH, MAPPABILITY, NA, etc ####
data <- filterData(data, 1:24, minDepth = 10, maxDepth = 200, map = NULL)

#### EM (FWD-BACK) TO TRAIN PARAMETERS ####
#### Can use parallelization packages ####
K <- length(params$genotypeParams$alphaKHyper)
params$genotypeParams$alphaKHyper <- rep(500, K)
params$ploidyParams$phi_0 <- 1.5
convergeParams <- runEMclonalCN(data, gParams = params$genotypeParams, 
                                nParams = params$normalParams, 
                                pParams = params$ploidyParams, 
                                sParams = params$cellPrevParams, 
                                maxiter = 3, maxiterUpdate = 500, 
                                txnExpLen = 1e15, txnZstrength = 5e5, 
                                useOutlierState = FALSE, 
                                normalEstimateMethod = "map", 
                                estimateS = TRUE, estimatePloidy = TRUE)

Run the code above in your browser using DataLab