flexmixedruns
fits a latent class
mixture (clustering) model where some variables are continuous
and modelled within the mixture components by Gaussian distributions
and some variables are categorical and modelled within components by
independent multinomial distributions. The fit is by maximum
likelihood estimation computed with the EM-algorithm. The number of
components can be estimated by the BIC.
Note that at least one categorical variable is needed, but it is possible to use data without continuous variable.
flexmixedruns(x,diagonal=TRUE,xvarsorted=TRUE,
continuous,discrete,ppdim=NULL,initial.cluster=NULL,
simruns=20,n.cluster=1:20,verbose=TRUE,recode=TRUE,
allout=TRUE,control=list(minprior=0.001),silent=TRUE)
data matrix or data frame. The data need to be organised case-wise, i.e., if there are categorical variables only, and 15 cases with values c(1,1,2) on the 3 variables, the data matrix needs 15 rows with values 1 1 2. (Categorical variables could take numbers or strings or anything that can be coerced to factor levels as values.)
logical. If TRUE
, Gaussian models are fitted
restricted to diagonal covariance matrices. Otherwise, covariance
matrices are unrestricted. TRUE
is consistent with the
"within class independence" assumption for the multinomial
variables.
logical. If TRUE
, the continuous variables
are assumed to be the first ones, and the categorical variables to
be behind them.
vector of integers giving positions of the
continuous variables. If xvarsorted=TRUE
, a single integer,
number of continuous variables.
vector of integers giving positions of the
categorical variables. If xvarsorted=TRUE
, a single integer,
number of categorical variables.
vector of integers specifying the number of (in the data)
existing categories for each categorical variable. If
recode=TRUE
, this can be omitted and is computed
automatically.
this corresponds to the cluster
parameter in flexmix
and should only be specified if
simruns=1
and n.cluster
is a single number.
Either a matrix with n.cluster
columns of initial cluster
membership probabilities for each observation; or a factor or
integer vector with the initial cluster assignments of
observations at the start of the EM algorithm. Default is
random assignment into n.cluster
clusters.
integer. Number of starts of the EM algorithm with random initialisation in order to find a good global optimum.
vector of integers, numbers of components (the optimum one is found by minimising the BIC).
logical. If TRUE
, some information about the
different runs of the EM algorithm is given out.
logical. If TRUE
, the function
discrete.recode
is applied in order to recode categorical
data so that the lcmixed
-method can use it. Only set this
to FALSE
if your data already has that format (even it that
case, TRUE
doesn't do harm). If recode=FALSE
, the
categorical variables are assumed to be coded 1,2,3,...
logical. If TRUE
, the regular
flexmix
-output is given out for every single number of
clusters, which can create a huge output object.
list of control parameters for flexmix
, for
details see the help page of FLXcontrol-class
.
logical. This is passed on to the
try
-function. If FALSE
, error messages from
failed runs of flexmix
are suppressed. (The information that
a flexmix
-error occurred is still given out if
verbose=TRUE
).
A list with components
summary object for flexmix
object with
optimal number of components.
optimal number of components.
vector with numbers of EM runs for each number of components that led to flexmix errors.
if allout=TRUE
, list of flexmix output objects
for all numbers of components, for details see the help page of
flexmix-class
. Slots that can be used
include for example cluster
and components
. So
if fo
is the flexmixedruns
-output object,
fo$flexout[[fo$optimalk]]@cluster
gives a component number
vector for the observations (maximum posterior rule), and
fo$flexout[[fo$optimalk]]@components
gives the estimated
model parameters, which for lcmixed
and therefore
flexmixedruns
are called
mean vector
covariance matrix
list of categorical variable-wise category probabilities
allout=FALSE
, only the flexmix output object for the
optimal number of components, i.e., the [[fo$optimalk]]
indexing above can then be omitted.vector of values of the BIC for each number of components.
vector of categorical variable-wise numbers of categories.
list of levels of the categorical variables
belonging to what is treated by flexmixedruns
as category
1, 2, 3 etc.
Sometimes flexmix produces errors because of degenerating covariance
matrices, too small clusters etc. flexmixedruns
tolerates these
and treats them as non-optimal runs. (Higher simruns
or
different control
may be required to get a valid solution.)
General documentation on flexmix can be found in Friedrich Leisch's "FlexMix: A General Framework for Finite Mixture Models and Latent Class Regression in R", http://cran.r-project.org/web/packages/flexmix/vignettes/flexmix-intro.pdf
Hennig, C. and Liao, T. (2013) How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification, Journal of the Royal Statistical Society, Series C Applied Statistics, 62, 309-369.
lcmixed
, flexmix
,
FLXcontrol-class
,
flexmix-class
,
discrete.recode
.
# NOT RUN {
options(digits=3)
set.seed(776655)
v1 <- rnorm(100)
v2 <- rnorm(100)
d1 <- sample(1:5,100,replace=TRUE)
d2 <- sample(1:4,100,replace=TRUE)
ldata <- cbind(v1,v2,d1,d2)
fr <- flexmixedruns(ldata,
continuous=2,discrete=2,simruns=2,n.cluster=2:3,allout=FALSE)
print(fr$optimalk)
print(fr$optsummary)
print(fr$flexout@cluster)
print(fr$flexout@components)
# }
Run the code above in your browser using DataLab