mixture_generator: Gaussian mixtures dataset generator with regression between the covariates

Description

Generates a dataset (with an additional validation sample) made of Gaussian mixtures with some of them generated by sub-regressions on others. A response variable is then added by linear regression. This function is used to generate datasets for simulations using CorReg, or just with Gaussian Mitures.

Usage

mixture_generator(
  n = 130,
  p = 100,
  ratio = 0.4,
  max_compl = 1,
  valid = 1000,
  positive = 0.6,
  sigma_Y = 10,
  sigma_X = NULL,
  R2 = NULL,
  R2Y = 0.4,
  meanvar = NULL,
  sigmavar = NULL,
  lambda = 3,
  Amax = NULL,
  lambdapois = 10,
  gamma = FALSE,
  gammashape = 1,
  gammascale = 0.5,
  tp1 = 1,
  tp2 = 1,
  tp3 = 1,
  nonlin = 0,
  pnonlin = 2,
  scale = TRUE,
  Z = NULL
)

Arguments

the number of individuals in the learning dataset

the number of covariates (without the response)

ratio

the ratio of covariates generated by sub-regressions on others

max_compl

the number of covariates in each sub-regression

valid

the number of individuals in the validation sample

positive

the ratio of positive coefficients in both the regression and the sub-regressions

sigma_Y

the standard deviation for the noise of the regression

sigma_X

the standard deviation for the noise of the sub-regressions (all). ignored if gamma=TRUE or if R2 is not NULL

the strength of the sub-regressions (coefficients will be chosen to obtain this value).

R2Y

the strength of the main regression (coefficients will be chosen to obtain this value).

meanvar

vector of means for the covariates.

sigmavar

standard deviation of the covariates.

lambda

parameter of the Poisson's law that defines the number of components in Gaussian Mixture models

Amax

the maximum number of covariates with non-zero coefficients in the regression

lambdapois

parameter used to generate the coefficient in the subregressions. Poisson's distribution.

gamma

(boolean) to generate a p-sized vector sigma_X gamma-distributed

gammashape

shape parameter of the gamma distribution (if needed)

gammascale

scale parameter of the gamma distribution (if needed)

tp1

the ratio of right-side (explicative) covariates allowed to have a non-zero coefficient in the regression

tp2

the ratio of left-side (redundant) covariates allowed to have a non-zero coefficient in the regression

tp3

the ratio of strictly independent covariates allowed to have a non-zero coefficient in the regression

nonlin

to use non linear structure (squared or log). If not null, it is the proba to use power pnonlin instead of log. The type is drawn for each link between covariates

pnonlin

the power used if non linear structure

scale

(boolean) to scale X before computing Y

the binary squared adjacency matrix (size p) to obtain. If NULL it is randomly generated, based on ratio and max_compl parameters.

Value

a list that contains:

X_appr

matrix of the learning set. p covariates following Gaussian Mixtures with some of them generated by sub-regressions on others.

Y_appr

Response variable vector (size n) generated by linear regression on X_appr with coefficients A and residual standard deviation sigma_Y.

vector of the of the regression generating Y_appr

Matrix of the coefficients of sub-regressions (first line: the intercepts) then B[i-1,j] is the coefficient associated to X_appr[,i] in the sub-regression that generates X_appr[,j]

Binary squared adjacency matrix of size p that describes the structure of sub-regressions. Z[i,j]=1 if X_appr[,i] explains X_appr[,j]

X_test

validation sample generated the same way as X_appr, with valid individuals.

Y_test

Response vector associated to the validation sample

sigma_X

Vector of the standard deviations of the residuals of the sub-regressions (one value for each sub-regression)

sigma_Y

Standard deviation of the residual of the regression that generates Y_appr and Y_test.

nbcomp

vector of the number of components for covariates that are not explained by others.

Examples

Run this code

# NOT RUN {
# dataset generation
base = mixture_generator(n = 250, p = 4, valid = 0)
X_appr = base$X_appr # learning sample
Y_appr = base$Y_appr # response variable
for (i in 1:ncol(X_appr)) {
  hist(X_appr[, i])
}

# }

Run the code above in your browser using DataLab