Rmixmod-package: Rmixmod a MIXture MODelling package

Description

Rmixmod is a package based on the existing MIXMOD software. MIXMOD is a tool for fitting a mixture model of multivariate gaussian or multinomial components to a given data set with either a clustering, a density estimation or a discriminant analysis point of view.

Arguments

Author

Author: Florent Langrognet and Remi Lebret and Christian Poli and Serge Iovleff, with contributions from C. Biernacki and G. Celeux and G. Govaert contact@mixmod.org

Details

The general purpose of the package is to discover, or explain, group structures in multivariate data sets with unknown (cluster analysis or clustering) or known class discriminant analysis or classification). It is an exploratory data analysis tool for solving clustering and classification problems. But it can also be regarded as a semi-parametric tool to estimate densities with Gaussian mixture distributions and multinomial distributions.

Mathematically, mixture probability density function (pdf) $f$ is a weighted sum of $K$ components densities:

$$ f({\bf x}_i|\theta) = \sum_{k=1}^{K}p_kh({\bf x}_i|\lambda_k) $$ where $h(.|{\lambda}_k)$ denotes a $d$-dimensional distribution parametrized by $\lambda_k$. The parameters are the mixing proportions $p_k$ and the component of the distribution $\lambda_k$.

In the Gaussian case, $h$ is the density of a Gaussian distribution with mean $\mu_k$ and variance matrix $\Sigma_k$, and thus $\lambda_k = (\mu_k,\Sigma_k)$.

In the qualitative case, $h$ is a multinomial distribution and $\lambda_k=(a_k,\epsilon_k)$ is the parameter of the distribution.

Estimation of the mixture parameters is performed either through maximum likelihood via the EM (Expectation Maximization, Dempster et al. 1977), the SEM (Stochastic EM, Celeux and Diebolt 1985) algorithm or through classification maximum likelihood via the CEM algorithm (Clustering EM, Celeux and Govaert 1992). These three algorithms can be chained to obtain original fitting strategies (e.g. CEM then EM with results of CEM) to use advantages of each of them in the estimation process. As mixture problems usually have multiple relative maxima, the program will produce different results, depending on the initial estimates supplied by the user. If the user does not input his own initial estimates, some initial estimates procedures are proposed (random centers for instance).

It is possible to constrain some input parameters. For example, dispersions can be equal between classes, etc.

In the Gaussian case, fourteen models are implemented. They are based on the eigenvalue decomposition, are most generally used. They depend on constraints on the variance matrix such as same variance matrix between clusters, spherical variance matrix... and they are suitable for data sets in any dimension.

In the qualitative case, five multinomial models are available. They are based on a reparametrization of the multinomial probabilities.

In both cases, the models and the number of clusters can be chosen by different criteria: BIC (Bayesian Information Criterion), ICL (Integrated Completed Likelihood, a classification version of BIC), NEC (Entropy Criterion), or Cross-Validation (CV).

References

Biernacki C., Celeux G., Govaert G., Langrognet F., 2006. "Model-Based Cluster and Discriminant Analysis with the MIXMOD Software". Computational Statistics and Data Analysis, vol. 51/2, pp. 587-600.

Lebret R., Iovleff S., Langrognet F., Biernacki C., Celeux G., Govaert G. (2015). "Rmixmod: The R Package of the Model-Based Unsupervised, Supervised, and Semi-Supervised Classification Mixmod Library". Journal of Statistical Software, 67(6), 1–29. https://doi.org/10.18637/jss.v067.i06

Examples

Run this code

if (FALSE) {
## Clustering Analysis
# load quantitative data set
data(geyser)
# Clustering in gaussian case
xem1 <- mixmodCluster(geyser, 3)
summary(xem1)
plot(xem1)
hist(xem1)

# load qualitative data set
data(birds)
# Clustering in multinomial case
xem2 <- mixmodCluster(birds, 2)
summary(xem2)
barplot(xem2)

# load heterogeneous data set
data(finance)
# Clustering in composite case
xem3 <- mixmodCluster(finance, 2:6)
summary(xem3)

## Discriminant Analysis
# start by extract 10 observations from iris data set
remaining.obs <- sample(1:nrow(iris), 10)
# then run a mixmodLearn() analysis without those 10 observations
learn <- mixmodLearn(iris[-remaining.obs, 1:4], iris$Species[-remaining.obs])
# create a MixmodPredict to predict those 10 observations
prediction <- mixmodPredict(
    data = iris[remaining.obs, 1:4],
    classificationRule = learn["bestResult"]
)
# show results
prediction
# compare prediction with real results
paste("accuracy= ", mean(as.integer(iris$Species[remaining.obs]) == prediction["partition"]) * 100,
    "%",
    sep = ""
)
}

Run the code above in your browser using DataLab