madlib.elnet: MADlib's elastic net regularization for generalized linear models

Description

This function wraps MAdlib's elastic net regularization for generalized linear models. Currently linear and logistic regressions are supported.

Usage

madlib.elnet(formula, data, family = c("gaussian", "linear", "binomial",
"logistic"), na.action = NULL, na.as.level = FALSE, alpha = 1, lambda = 0.1,
standardize = TRUE, method = c("fista", "igd", "sgd", "cd"),
control = list(), glmnet = FALSE, ...)

Arguments

formula

A formula (or one that can be coerced to that class), specifies the dependent and independent variables.

data

A '>db.obj object. Currently, this parameter is mandatory. If it is an object of class db.Rquery or db.view, a temporary table will be created, and further computation will be done on the temporary table. After the computation, the temporary will be dropped from the corresponding database.

family

A string which indicates which form of regression to apply. Default value is "gaussian". The accepted values are: "gaussian" or "linear": Linear regression; "binomial" or "logistic": Logistic regression. The support for other families will be added in the future.

na.action

A string which indicates what should happen when the data contain NAs. Possible values include na.omit, "na.exclude", "na.fail" and NULL. Right now, na.omit has been implemented. When the value is NULL, nothing is done on the R side and NA values are filtered on the MADlib side. User defined na.action function is allowed.

na.as.level

A logical value, default is FALSE. Whether to treat NA value as a level in a categorical variable or just ignore it.

alpha

A numeric value in [0,1], elastic net mixing parameter. The penalty is defined as

(1-alpha)/2||beta||_2^2+alpha||beta||_1.

'alpha=1' is the lasso penalty, and 'alpha=0' the ridge penalty.

lambda

A positive numeric value, the regularization parameter.

standardize

A logical, default: TRUE. Whether to normalize the data. Setting this to TRUE usually yields better results and faster convergence.

method

A string, default: "fista". Name of optimizer, "fista", "igd"/"sgd" or "cd". "fista" means the fast iterative shrinkage-thresholding algorithm [1], and "sgd" implements the stochastic gradient descent algorithm [2]. "cd" implements the coordinate descent algorithm [5].

control

A list, which contains the control parameters for the optimizers.

(1) If method is "fista", the allowed control parameters are:

- max.stepsize A numeric value, default is 4.0. Initial backtracking step size. At each iteration, the algorithm first tries stepsize = max.stepsize, and if it does not work out, it then tries a smaller step size, stepsize = stepsize/eta, where eta must be larger than 1. At first glance, this seems to perform repeated iterations for even one step, but using a larger step size actually greatly increases the computation speed and minimizes the total number of iterations. A careful choice of max_stepsize can decrease the computation time by more than 10 times.

- eta A numeric value, default is 2. If stepsize does not work stepsize / eta is tried. Must be greater than 1.

- use.active.set A logical value, default is FALSE. If use_active_set is TRUE, an active-set method is used to speed up the computation. Considerable speedup is obtained by organizing the iterations around the active set of features -- those with nonzero coefficients. After a complete cycle through all the variables, we iterate on only the active set until convergence. If another complete cycle does not change the active set, we are done, otherwise the process is repeated.

- activeset.tolerance A numeric value, default is the value of the tolerance argument (see below). The value of tolerance used during active set calculation.

- random.stepsize A logical value, default is FALSE. Whether to add some randomness to the step size. Sometimes, this can speed up the calculation.

(2) If method is "sgd", the allowed control parameters are:

- stepsize The default is 0.01.

- step.decay A numeric value, the actual setpsize used for current step is (previous stepsize) / exp(setp.decay). The default value is 0, which means that a constant stepsize is used in SGD.

- threshold A numeric value, default is 1e-10. When a coefficient is really small, set this coefficient to be 0. Due to the stochastic nature of SGD, we can only obtain very small values for the fitting coefficients. Therefore, threshold is needed at the end of the computation to screen out tiny values and hard-set them to zeros. This is accomplished as follows: (1) multiply each coefficient with the standard deviation of the corresponding feature; (2) compute the average of absolute values of re-scaled coefficients; (3) divide each rescaled coefficient with the average, and if the resulting absolute value is smaller than threshold, set the original coefficient to zero.

- parallel A logical value, the default is True. Whether to run the computation on multiple segments. SGD is a sequential algorithm in nature. When running in a distributed manner, each segment of the data runs its own SGD model and then the models are averaged to get a model for each iteration. This averaging might slow down the convergence speed, although we also acquire the ability to process large datasets on multiple machines. This algorithm, therefore, provides the parallel option to allow you to choose whether to do parallel computation.

(3) The common control parameters for both "fista" and "sgd" optimizers:

- max.iter An integer, default is 100. The maximum number of iterations that are allowed.

- tolerance A numeric value, default is 1e-4. The criteria to end iterations. Basically 1e-4 will produce results with 4 significant digits. Both the "fista" and "sgd" optimizers compute the average difference between the coefficients of two consecutive iterations, and when the difference is smaller than tolerance or the iteration number is larger than max_iter, the computation stops.

- warmup A logical value, default is FALSE. If warmup is TRUE, a series of lambda values, which is strictly descent and ends at the lambda value that the user wants to calculate, is used. The larger lambda gives very sparse solution, and the sparse solution again is used as the initial guess for the next lambda's solution, which speeds up the computation for the next lambda. For larger data sets, this can sometimes accelerate the whole computation and may be faster than computation on only one lambda value.

- warmup.lambdas A vector of numeric values, default is NULL. The lambda value series to use when warmup is True. The default is NULL, which means that lambda values will be automatically generated. The warmup lambda values start from a large value and end at the lambda value.

- warmup.lambda.no An integer Default: 15. How many lambdas are used in warm-up. If warmup_lambdas is not NULL, this value is overridden by the number of provided lambda values.

- warmup.tolerance A numeric value, the value of tolerance used during warmup. The default is the same as the tolerance value.

(4) The control parameters for "cd" optimizer include

max.iter, tolerance, use.active.set and verbose for family = "gaussian"

max.iter, tolerance, use.active.set, verbose, warmup, warmup.lambda.no for family = "binomial"

All parameters have been explained above. The only one left is verbose.

verbose A logical value, whether to output the warning message for "cd" optimizer. See the note section for details.

glmnet

A logical value, default is FALSE. The R package glmnet states that "Note also that for gaussian, glmnet standardizes y to have unit variance before computing its lambda sequence (and then unstandardizes the resulting coefficients); if you wish to reproduce/compare results with other software, best to supply a standardized y." So if the user wants to compare the result of this function with that of glmnet, he can set this value to be TRUE, which tells this function to do the same data transformation as glmnet in the "gaussian" case so that one can easily compare the results. When family = "binomial", this parameter is ignored.

…

More arguments, currently not implemented.

Value

An object of elnet.madlib class, which is actually a list that contains the following items:

coef

A vector, the fitting coefficients.

intercept

A numeric value, the intercept.

y.scl

A numeric value, which is used to scale the dependent values. In the "gaussian" case, it is 1 if glmnet is FALSE, and it is the standard deviation of the dependent variable if glmnet is TRUE.

loglik

A numeric value, the log-likelihood of the fitting result.

standardize

The standardize value in the arguments.

iter

An integer, the itertion number used.

ind.str

A string. The independent variables in an array format string.

terms

A terms object, describing the terms in the model formula.

model

A '>db.data.frame object, which wraps the result table of this function. When method = "cd", there is no result table, because all the results are in R side.

call

A language object. The function call that generates this result.

alpha

The alpha in the arguments.

lambda

The lambda in the arguments.

method

The method string in the arguments.

family

The family string in the arguments.

appear

An array of strings, the same length as the number of independent variables. The strings are used to print a clean result, especially when we are dealing with the factor variables, where the dummy variable names can be very long due to the inserting of a random string to avoid naming conflicts, see as.factor,db.obj-method for details. The list also contains dummy and dummy.expr, which are also used for processing the categorical variables, but do not contain any important information.

max.iter, tolerance

The max.iter and tolerance in the control.

Details

the objective function for "gaussian" is

1/2 RSS/nobs + lambda*penalty,

and for the other models it is

-loglik/nobs + lambda*penalty.

References

[1] Beck, A. and M. Teboulle (2009), A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. on Imaging Sciences 2(1), 183-202.

[2] Shai Shalev-Shwartz and Ambuj Tewari, Stochastic Methods for l1 Regularized Loss Minimization. Proceedings of the 26th International Conference on Machine Learning, Montreal, Canada, 2009.

[3] Elastic net regularization. http://en.wikipedia.org/wiki/Elastic_net_regularization

[4] Kevin P. Murphy, Machine Learning: A Probabilistic Perspective, The MIT Press, Chap 13.4, 2012.

[5] Jerome Friedman, Trevor Hastie and Rob Tibshirani, Regularization Paths for Generalized Linear Models via Coordinate Descent, Journal of Statistical Software, Vol. 33(1), 2010.

Examples

Run this code

# NOT RUN {
<!-- %% @test .port Database port number -->
<!-- %% @test .dbname Database name -->
## set up the database connection
## Assume that .port is port number and .dbname is the database name
cid <- db.connect(port = .port, dbname = .dbname, verbose = FALSE)

x <- matrix(rnorm(100*20),100,20)
y <- rnorm(100, 0.1, 2)

dat <- data.frame(x, y)

delete("eldata")
z <- as.db.data.frame(dat, "eldata", conn.id = cid, verbose = FALSE)

fit <- madlib.elnet(y ~ ., data = z, alpha = 0.2, lambda = 0.05, control
= list(random.stepsize=TRUE))

fit

lk(mean((z$y - predict(fit, z))^2)) # mean square error

fit <- madlib.elnet(y ~ ., data = z, alpha = 0.2, lambda = 0.05, method = "cd")

fit

db.disconnect(cid, verbose = FALSE)
# }

Run the code above in your browser using DataLab