Generates a dataset (with an additional validation sample) made of Gaussian mixtures with some of them generated by sub-regressions on others. A response variable is then added by linear regression. This function is used to generate datasets for simulations using CorReg, or just with Gaussian Mitures.
mixture_generator(
n = 130,
p = 100,
ratio = 0.4,
max_compl = 1,
valid = 1000,
positive = 0.6,
sigma_Y = 10,
sigma_X = NULL,
R2 = NULL,
R2Y = 0.4,
meanvar = NULL,
sigmavar = NULL,
lambda = 3,
Amax = NULL,
lambdapois = 10,
gamma = FALSE,
gammashape = 1,
gammascale = 0.5,
tp1 = 1,
tp2 = 1,
tp3 = 1,
nonlin = 0,
pnonlin = 2,
scale = TRUE,
Z = NULL
)
the number of individuals in the learning dataset
the number of covariates (without the response)
the ratio of covariates generated by sub-regressions on others
the number of covariates in each sub-regression
the number of individuals in the validation sample
the ratio of positive coefficients in both the regression and the sub-regressions
the standard deviation for the noise of the regression
the standard deviation for the noise of the sub-regressions (all). ignored if gamma=TRUE
or if R2
is not NULL
the strength of the sub-regressions (coefficients will be chosen to obtain this value).
the strength of the main regression (coefficients will be chosen to obtain this value).
vector of means for the covariates.
standard deviation of the covariates.
parameter of the Poisson's law that defines the number of components in Gaussian Mixture models
the maximum number of covariates with non-zero coefficients in the regression
parameter used to generate the coefficient in the subregressions. Poisson's distribution.
(boolean) to generate a p-sized vector sigma_X
gamma-distributed
shape parameter of the gamma distribution (if needed)
scale parameter of the gamma distribution (if needed)
the ratio of right-side (explicative) covariates allowed to have a non-zero coefficient in the regression
the ratio of left-side (redundant) covariates allowed to have a non-zero coefficient in the regression
the ratio of strictly independent covariates allowed to have a non-zero coefficient in the regression
to use non linear structure (squared or log). If not null, it is the proba to use power pnonlin instead of log. The type is drawn for each link between covariates
the power used if non linear structure
(boolean) to scale X before computing Y
the binary squared adjacency matrix (size p) to obtain. If NULL it is randomly generated, based on ratio
and max_compl
parameters.
a list that contains:
matrix of the learning set. p
covariates following Gaussian Mixtures with some of them generated by sub-regressions on others.
Response variable vector (size n
) generated by linear regression on X_appr
with coefficients A
and residual standard deviation sigma_Y
.
vector of the of the regression generating Y_appr
Matrix of the coefficients of sub-regressions (first line: the intercepts) then B[i-1,j]
is the coefficient associated to X_appr[,i]
in the sub-regression that generates X_appr[,j]
Binary squared adjacency matrix of size p
that describes the structure of sub-regressions. Z[i,j]
=1 if X_appr[,i]
explains X_appr[,j]
validation sample generated the same way as X_appr
, with valid
individuals.
Response vector associated to the validation sample
Vector of the standard deviations of the residuals of the sub-regressions (one value for each sub-regression)
Standard deviation of the residual of the regression that generates Y_appr
and Y_test
.
vector of the number of components for covariates that are not explained by others.
# NOT RUN {
# dataset generation
base = mixture_generator(n = 250, p = 4, valid = 0)
X_appr = base$X_appr # learning sample
Y_appr = base$Y_appr # response variable
for (i in 1:ncol(X_appr)) {
hist(X_appr[, i])
}
# }
Run the code above in your browser using DataLab