Learn R Programming

CorReg (version 1.0.5)

mixture_generator: Gaussian mixtures dataset generator with regression between the covariates

Description

Generates a dataset (with an additional validation sample) made of Gaussian mixtures with some of them generated by sub-regressions on others. A response variable is then added by linear regression. This function is used to generate datasets for simulations using CorReg, or just with Gaussian Mitures.

Usage

mixture_generator(n = 130, p = 100, ratio = 0.4, max_compl = 1,
  valid = 1000, positive = 0.6, sigma_Y = 10, sigma_X = NULL,
  R2 = NULL, R2Y = 0.4, meanvar = NULL, sigmavar = NULL, lambda = 3,
  Amax = NULL, lambdapois = 10, gamma = FALSE, gammashape = 1,
  gammascale = 0.5, tp1 = 1, tp2 = 1, tp3 = 1, nonlin = 0,
  pnonlin = 2, scale = TRUE, Z = NULL)

Arguments

n
the number of individuals in the learning dataset
p
the number of covariates (without the response)
ratio
the ratio of covariates generated by sub-regressions on others
max_compl
the number of covariates in each sub-regression
valid
the number of individuals in the validation sample
positive
the ratio of positive coefficients in both the regression and the sub-regressions
sigma_Y
the standard deviation for the noise of the regression
sigma_X
the standard deviation for the noise of the sub-regressions (all). ignored if gamma=TRUE or if R2 is not NULL
R2
the strength of the sub-regressions (coefficients will be chosen to obtain this value).
R2Y
the strength of the main regression (coefficients will be chosen to obtain this value).
meanvar
vector of means for the covariates.
sigmavar
standard deviation of the covariates.
lambda
parameter of the Poisson's law that defines the number of components in Gaussian Mixture models
Amax
the maximum number of covariates with non-zero coefficients in the regression
lambdapois
parameter used to generate the coefficient in the subregressions. Poisson's distribution.
gamma
(boolean) to generate a p-sized vector sigma_X gamma-distributed
gammashape
shape parameter of the gamma distribution (if needed)
gammascale
scale parameter of the gamma distribution (if needed)
tp1
the ratio of right-side (explicative) covariates allowed to have a non-zero coefficient in the regression
tp2
the ratio of left-side (redundant) covariates allowed to have a non-zero coefficient in the regression
tp3
the ratio of strictly independent covariates allowed to have a non-zero coefficient in the regression
nonlin
to use non linear structure (squared or log). If not null, it is the proba to use power pnonlin instead of log. The type is drawn for each link between covariates
pnonlin
the power used if non linear structure
scale
(boolean) to scale X before computing Y
Z
the binary squared adjacency matrix (size p) to obtain. If NULL it is randomly generated, based on ratio and max_compl parameters.

Value

  • a list that contains:
  • X_apprmatrix of the learning set. p covariates following Gaussian Mixtures with some of them generated by sub-regressions on others.
  • Y_apprResponse variable vector (size n) generated by linear regression on X_appr with coefficients A and residual standard deviation sigma_Y.
  • Avector of the of the regression generating Y_appr
  • BMatrix of the coefficients of sub-regressions (first line : the intercepts) then B[i-1,j] is the coefficient associated to X_appr[,i] in the sub-regression that generates X_appr[,j]
  • ZBinary squared adjacency matrix of size p that describes the structure of sub-regressions. Z[i,j]=1 if X_appr[,i] explains X_appr[,j]
  • X_testvalidation sample generated the same way as X_appr, with valid individuals.
  • Y_testResponse vector associated to the validation sample
  • sigma_XVector of the standard deviations of the residuals of the sub-regressions (one value for each sub-regression)
  • sigma_YStandard deviation of the residual of the regression that generates Y_appr and Y_test.
  • nbcompvector of the number of components for covariates that are not explained by others.

Examples

Run this code
require(CorReg)
   #dataset generation
   base=mixture_generator(n=1500,p=10,valid=0)
   X_appr=base$X_appr #learning sample
   Y_appr=base$Y_appr#response variable
   for(i in 1:ncol(X_appr)){
   hist(X_appr[,i])
   }

Run the code above in your browser using DataLab