init.model.params: Initiation of model parameters

Description

Methods for the initiation of model parameters for the EM algorithm. Two initiation procedures are implemented. The first procedure is available by setting the argument method='knowns'. It takes into account only labeled observations and is thus suitable for datasets with a high percentage of labeled cases. The second is available by setting method='all' and does not take the labeling into account.

Usage

init.model.params(X = NULL, knowns = NULL, class = NULL, 
    k = length(unique(class)), method = "all", B = P, P = NULL)

Arguments

a data.frame with the unlabeled observations, its rows correspond to the observations while the columns correspond to variables/data dimensions.

knowns

a data.frame with the labeled observations, rows correspond to the observations while the columns correspond to variables/data dimensions.

a beliefs matrix with the distribution of beliefs for the labeled observations. If not specified and the argument P is given, the beliefs matrix is set to the value of P.

a matrix of plausibilities, specified only for the labeled observations. The function assumes that the remaining observations are unlabeled and gives them uniformly distributed plausibilities by default. If not specified and the argument B is given, the plausibilities matrix is set to the value of B.

class

class is a vector of labels for the known observations. If not specified, it is derived from eithter the argument B or P with the use of the MAP rule.

the desired number of model components.

method

a method for parameter initialization, one of following c("knowns","all"), see the section Details.

Value

A list with the following elements:

a vector of length k with the initial values for the mixing proportions.

a matrix with the means' vectors with the initial values for k components.

cvar

a three-dimensional matrix with the covariance matrices with the initial values for k components.

Details

For method='knowns', the initialization is based only on the labeled observations. i.e. those observations which have certain or probable components assigned. The initial model parameters for each component are estimated in one step from the observations that are assigned to this component (as in fully supervised learning).

If method='all' (default), the initialization is based on all observations. In this case, to obtain the initial set of model components, we start by clustering the data using the k-means algorithm (repeated 10 times to get stable results). The only exception is for one dimensional data. In such a case the clusters are identified by dividing the data into k equal subsets of observations, where the subsets are separated by empirical quantiles c(1/2k, 3/2k, 5/2k, ..., (2k-1)/2k). After this initial clustering each cluster is linked to one model component and initial values for the model parameters are derived from the clustered observations.

For the partially and semi-supervised methods, correspondence of labels from the initial clustering algorithm and labels for the observations in the knowns dataset rises a technical problem. The cluster corresponding to component y should be as close as possible to the set of labeled observations with label y.

Note that for the unsupervised modeling this problem is irrelevant and any cluster may be used to initialize any component.

To mach the cluster labels with the labels of model components a greedy heuristic is used. The heuristic calculates weighted distances between all possible pairs of cluster centers and sets of observations grouped by their labels. In each step, the pair with a minimal distance is chosen (the pair: a group of observations with a common label and a cluster, for which the center of the group is the closest to the center of the cluster). For the chosen pair, the cluster is labeled with the same label as the group of observations. Then, this pair is removed and the heuristic repeats for the reduced set of pairs.

References

Przemyslaw Biecek, Ewa Szczurek, Martin Vingron, Jerzy Tiuryn (2012), The R Package bgmm: Mixture Modeling with Uncertain Knowledge, Journal of Statistical Software.

Examples

Run this code

# NOT RUN {
 data(genotypes)
 initial.params = init.model.params(X=genotypes$X, knowns=genotypes$knowns,
									 class = genotypes$labels)
 str(initial.params)
# }

Run the code above in your browser using DataLab