These functions implement a variety of initialization methods for the parameters of a Poisson mixture model: the Small EM initialization strategy (emInit
) described in Rau et al. (2011), a K-means initialization strategy (kmeanInit
) that is itself used to initialize the small EM strategy, the splitting small-EM initialization strategy (splitEMInit
) based on that described in Papastamoulis et al. (2014), and a function to initialize a small-EM strategy using the posterior probabilities (probaPostInit
) obtained from a previous run with one fewer cluster following the splitting strategy.
emInit(y, g, conds, norm, alg.type = "EM",
init.runs, init.iter, fixed.lambda, equal.proportions, verbose)kmeanInit(y, g, conds, norm, fixed.lambda,
equal.proportions)
splitEMInit(y, g, conds, norm, alg.type, fixed.lambda,
equal.proportions, prev.labels, prev.probaPost, init.runs,
init.iter, verbose)
probaPostInit(y, g, conds, norm, alg.type = "EM",
fixed.lambda, equal.proportions, probaPost.init, init.iter,
verbose)
Vector of length g
containing the estimate for \(\hat{\boldsymbol{\pi}}\) corresponding to the highest log likelihood (or completed log likelihood) from the chosen inialization strategy.
(d x g
) matrix containing the estimate of \(\hat{\boldsymbol{\lambda}}\) corresponding to the highest log likelihood (or completed log likelihood) from the chosen initialization strategy, where d is the number of conditions and g
is the number of clusters.
(d x g
) matrix containing the estimate of \(\hat{\boldsymbol{\lambda}}\) arising from the splitting initialization and small EM run for a single split,
where d is the number of conditions and g
is the number of clusters.
Vector of length g
containing the estimate for \(\hat{\boldsymbol{\pi}}\) arising from the splitting initialization and small EM run for a single split, where g
is the
number of clusters.
Log likelihood arising from the splitting initialization and small EM run for a single split.
(n x q) matrix of observed counts for n observations and q variables
Number of clusters. If fixed.lambda
contains a list of lambda values to be fixed, g corresponds to the number of clusters in addition to those fixed.
Vector of length q defining the condition (treatment group) for each variable (column) in y
The type of estimator to be used to normalize for differences in library size: (“TC
” for total count, “UQ
” for upper quantile, “Med
” for median,
“DESeq
” for the normalization method in the DESeq package, and “TMM
” for the TMM normalization method (Robinson and Oshlack, 2010). Can also
be a vector (of length q) containing pre-estimated library size estimates for each sample.
Algorithm to be used for parameter estimation (“EM
” or “CEM
” for the EM or CEM algorithms, respectively)
In the case of the Small-EM algorithm, the number of independent runs to be performed. In the case of the splitting Small-EM algorithm, the number of cluster splits to be performed in the splitting small-EM initialization.
The number of iterations to run within each Small-EM algorithm
If one (or more) clusters with fixed values of lambda is desires, a list containing vectors of length d (the number of conditions). Note that the values of lambda chosen must satisfy the constraint noted in the technical report.
If TRUE
, the cluster proportions are set to be equal for all clusters. Default is FALSE
(unequal cluster proportions)
A vector of length n of cluster labels obtained from the previous run (g-1 clusters)
An n x (g-1) matrix of the conditional probabilities of each observation belonging to each of the g-1 clusters from the previous run
An n x (g) matrix of the conditional probabilities of each observation belonging to each of the
g clusters following the splitting strategy in the splitEMInit
function
If TRUE
, include verbose output
Andrea Rau
In practice, the user will not directly call the initialization functions described here; they are indirectly called
for a single number of clusters through the PoisMixClus
function (via init.type
) or via the
PoisMixClusWrapper
function for a sequence of cluster numbers (via gmin.init.type
and split.init
).
To initialize parameter values for the EM and CEM algorithms, for the Small-EM strategy (Biernacki et al., 2003) we use the emInit
function as follows. For a given number of independent runs (given by init.runs
), the following procedure is used to obtain parameter values: first, a K-means algorithm (MacQueen, 1967) is run to partition the data into g
clusters (\(\hat{\boldsymbol{z}}^{(0)}\)). Second, initial parameter values \(\boldsymbol{\pi}^{(0)}\) and \(\boldsymbol{\lambda}^{(0)}\) are calculated (see Rau et al. (2011) for details). Third, a given number of iterations of an EM algorithm are run (defined by init.iter
), using \(\boldsymbol{\pi}^{(0)}\) and \(\boldsymbol{\lambda}^{(0)}\) as initial values. Finally, among the init.runs
sets of parameter values, we use \(\hat{\boldsymbol{\lambda}}\) and \(\hat{\boldsymbol{\pi}}\) corresponding to the highest log likelihood or completed log likelihood to initialize the subsequent full EM or CEM algorithms, respectively.
For the splitting small EM initialization strategy, we implement an approach similar to that described in Papastamoulis et al. (2014), where the cluster from the previous run (with g-1 clusters) with the largest entropy is chosen to be split into two new clusters, followed by a small EM run as described above.
Anders, S. and Huber, W. (2010) Differential expression analysis for sequence count data. Genome Biology, 11(R106), 1-28.
Biernacki, C., Celeux, G., Govaert, G. (2003) Choosing starting values for the EM algorithm for getting the highest likelhiood in multivariate Gaussian mixture models. Computational Statistics and Data Analysis, 41(1), 561-575.
MacQueen, J. B. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, number 1, pages 281-297. Berkeley, University of California Press.
Papastamoulis, P., Martin-Magniette, M.-L., and Maugis-Rabusseau, C. (2014). On the estimation of mixtures of Poisson regression models with large number of components. Computational Statistics and Data Analysis: 3rd special Issue on Advances in Mixture Models, DOI: 10.1016/j.csda.2014.07.005.
Rau, A., Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C. (2011). Clustering high-throughput sequencing data with Poisson mixture models. Inria Research Report 7786. Available at https://inria.hal.science/inria-00638082.
Rau, A., Maugis-Rabusseau, C., Martin-Magniette, M.-L., Celeux G. (2015). Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics, 31(9):1420-1427.
Robinson, M. D. and Oshlack, A. (2010) A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11(R25).
PoisMixClus
for Poisson mixture model estimation for a given number of clusters,
PoisMixClusWrapper
for Poisson mixture model estimation and model selection for a sequence of cluster numbers.
set.seed(12345)
## Simulate data as shown in Rau et al. (2011)
## Library size setting "A", high cluster separation
## n = 500 observations
simulate <- PoisMixSim(n = 500, libsize = "A", separation = "high")
y <- simulate$y
conds <- simulate$conditions
## Calculate initial values for lambda and pi using the Small-EM
## initialization (4 classes, PMM-II model with "TC" library size)
##
## init.values <- emInit(y, g = 4, conds,
## norm = "TC", alg.type = "EM",
## init.runs = 50, init.iter = 10, fixed.lambda = NA,
## equal.proportions = FALSE, verbose = FALSE)
## pi.init <- init.values$pi.init
## lambda.init <- init.values$lambda.init
Run the code above in your browser using DataLab