simulateData: Dataset generation

Description

The function simulateData generates an artificial dataset from a mixture of Gaussian components with a given set of parameters.

Usage

simulateData(d = 2, k = 4, n = 100, m = 10, mu = NULL, cvar = NULL, 
    s.pi = rep(1/k, k), b.min = 0.02, mean = "D", between = "D", 
    within = "D", cov = "D", n.labels = k)

Arguments

the dimension of the data set,

the number of the model components,

the total number of observations, both labeled and unlabeled,

a matrix with k rows and d columns, which defines the means' vectors for the corresponding model components. If not specified, by default its values are generated from a normal distribution N(0,49),

cvar

a three-dimensional array with the dimensions (k, d, d). If not specified, each covariance matrix is generated in three steps: first, 2*d samples from a d-dimensional normal distribution N(0, Id) are generated. Next, a covariance matrix d x d for these samples is calculated. Finally, the resulting sample covariance matrix is scaled by a factor generated from an exponential distribution Exp(1),

s.pi

a vector of k probabilities, i.e. the mixing proportions of the model. The mixing proportions specify a multinomial distribution over the components, from which the numbers of observations in each cluster are generated. By default a uniform distribution is used.

mean, between, within, cov

constraints on the model structure. By default all are equal to "D". If other values are set, the parameters mu and cvar are adjusted to match the specified constraints,

the number of the observations, for which the beliefs are to be calculated,

b.min

the belief that an observation does not belong to a component. Formally, the belief bij for the observation i to belong to component j is equal b.min if i is not generated from component j. Thus, the belief that i belongs to its true component is set to 1-b.min*(n.labels-1), and b.min is constrained that b.min$<1/$n.labels. By default b.min=0.02,

n.labels

the number of components used as labels, defining the number of columns in the resulting beliefs matrix. By default n.labels equals k, but the user can specify a smaller number. Using this argument the user can define a scenario in which the data are generated from a mixture of three components, but only two of them are used as labels in the beliefs matrix (applied in the example below).

Value

An list with the following elements:

the matrix of size n-m rows and d columns with generated values of unlabeled observations,

knowns

the matrix of size m rows and d columns with generated values of labeled observations,

the belief matrix of the size m rows and k columns derived for knowns matrix,

model.params

the list of model parameters,

Ytrue

indexes of the true Gaussian components from which each observation was generated. Lables for knowns go first.

References

Przemyslaw Biecek, Ewa Szczurek, Martin Vingron, Jerzy Tiuryn (2012), The R Package bgmm: Mixture Modeling with Uncertain Knowledge, Journal of Statistical Software.

Examples

Run this code

# NOT RUN {
 simulated = simulateData(d=2, k=3, n=300, m=60, cov="0", within="E", n.labels=2)
 model = belief(X = simulated$X, knowns = simulated$knowns, B=simulated$B)
 plot(model)

 simulated = simulateData(d=1, k=2, n=300, m=60, n.labels=2)
 model = belief(X = simulated$X, knowns = simulated$knowns, B=simulated$B)
 plot(model)
# }

Run the code above in your browser using DataLab