lm.fsreg: Variable selection in linear regression models with forward selection

Description

Variable selection in linear regression models with forward selection

Usage

lm.fsreg(target, dataset, threshold = 0.05, stopping = "BIC", tol = 2, 
robust = FALSE, ncores = 1)

Arguments

target

The class variable. Provide either a string, an integer, a numeric value, a vector, a factor, an ordered factor or a Surv object. See also Details.

dataset

The dataset; provide either a data frame or a matrix (columns = variables, rows = samples). In either case, only two cases are avaialble, either all data are continuous, or categorical.

threshold

Threshold (suitable values in [0,1]) for asmmmbsing p-values significance. Default value is 0.05.

stopping

The stopping rule. The BIC ("BIC") or the adjusted $R^2$ ("adjrsq") can be used.

tol

The difference bewtween two successive values of the stopping rule. By default this is is set to 2. If for example, the BIC difference between two succesive models is less than 2, the process stops and the last variable, even though significant does not e

robust

A boolean variable which indicates whether (TRUE) or not (FALSE) to use a robust version of the statistical test if it is available. It takes more time than a non robust version but it is suggested in case of outliers. Default value is FALSE.

ncores

How many cores to use. This plays an important role if you have tens of thousands of variables or really large sample sizes and tens of thousands of variables and a regression based test which requires numerical optimisation. In other cammmb it will not m

Value

The output of the algorithm is S3 object including:
matA matrix with the variables and their latest test statistics and p-values.
infoA matrix with the selected variables, their p-values and test statistics. Each row corresponds to a model which contains the variables up to that line. The BIC in the last column is the BIC of that model.
modelsThe regression models, one at each step.
finalThe final regression model.
runtimeThe run time of the algorithm. A numeric vector. The first element is the user time, the second element is the system time and the third element is the elapsed time.

Details

If the current 'test' argument is defined as NULL or "auto" and the user_test argument is NULL then the algorithm automatically selects the best test based on the type of the data. Particularly:

if target is a factor, the multinomial or the binary logistic regression is used. If the target has two values only, binary logistic regression will be used.
if target is a ordered factor, the ordered logit regression is used in the logistic test.
if target is a numerical vector and the dataset is a matrix or a data.frame with continuous variables, the Fisher conditional independence test is used. If the dataset is a data.frame and there are categorical variables, linear regression is used.
if target is discrete numerical (counts), the poisson regression conditional independence test is used. If there are only two values, the binary logistic regression is to be used.
if target is a Surv object, the Survival conditional independence test is used.
if target is a matrix with at least 2 columns, the multivariate linear regression is used.

References

Tsamardinos I., Aliferis C. F. and Statnikov, A. (2003). Time and sample efficient discovery of Markov blankets and direct causal relations. In Proceedings of the 9th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 673-678).

Examples

Run this code

set.seed(123)
#require(gRbase) #for faster computations in the internal functions
require(hash)

#simulate a dataset with continuous data
dataset <- matrix( runif(1000 * 50, 1, 100), ncol = 50 )

#define a simulated class variable 
target <- 3 * dataset[, 10] + 2 * dataset[, 20] + 3 * dataset[, 30] + rnorm(1000, 0, 5)
a <- lm.fsreg(target, dataset, threshold = 0.05, stopping = "BIC", tol = 2, 
robust = FALSE, ncores = 1 ) 
b=fs.reg(target, dataset, threshold = 0.05, test = NULL, stopping = "BIC", tol = 2, 
robust = TRUE, ncores = 1 )

Run the code above in your browser using DataLab