twoClassSim(n = 100, intercept = -5,
linearVars = 10, noiseVars = 0, corrVars = 0,
corrType = "AR1", corrValue = 0, mislabel = 0)
A
and B
above)J
, K
and L
above).C
through H
above)A
and B
) are created with a correlation our about 0.65. They change the log-odds using main effects and an interaction:intercept - 4A + 4B + 2AB
The intercept is a parameter for the simulation and can be used to control the amount of class imbalance.
The second set of effects are linear with coefficients that alternate signs and have values between 2.5 and 0.025. For example, if there were six predictors in this set, their contribution to the log-odds would be
-2.50C + 2.05D -1.60E + 1.15F -0.70G + 0.25H
The third set is a nonlinear function of a single predictor ranging between [0, 1] called J
here:
(J^3) + 2exp(-6(J-0.3)^2)
The fourth set of informative predictors are copied from one of Friedman's systems and use two more predictors (K
and L
):
2sin(KL)
All of these effects are added up to model the log-odds. This is used to calculate the probability of a sample being in the first class and a random uniform number is used to actually make the assignment of the actual class. To mislabel the data, the probability is reversed (i.e. p = 1 - p
) before the random number generation.
The user can also add non-informative predictors to the data. These are random standard normal predictors and can be optionally added to the data in two ways: a specified number of independent predictors or a set number of predictors that follow a particular correlation structure. The only two correlation structure that have been implemented are
r
was the correlation parameter, the between predictor correlation matrix would be| 1 sym | | r 1 | | r^2 r 1 | | r^3 r^2 r 1 | | r^4 r^3 r^2 r 1 |
example <- twoClassSim(100, linearVars = 1)
splom(~example[, 1:6], groups = example$Class)
Run the code above in your browser using DataLab