Learn R Programming

quanteda.textmodels (version 0.9.9)

textmodel_lr: Logistic regression classifier for texts

Description

Fits a fast penalized maximum likelihood estimator to predict discrete categories from sparse dfm objects. Using the glmnet package, the function computes the regularization path for the lasso or elasticnet penalty at a grid of values for the regularization parameter lambda. This is done automatically by testing on several folds of the data at estimation time.

Usage

textmodel_lr(x, y, ...)

Value

an object of class textmodel_lr, a list containing:

  • x, y the input model matrix and input training class labels

  • algorithm character; the type and family of logistic regression model used in calling cv.glmnet()

  • type the type of associated with algorithm

  • classnames the levels of training classes in y

  • lrfitted the fitted model object from cv.glmnet()

  • call the model call

Arguments

x

the dfm on which the model will be fit. Does not need to contain only the training documents.

y

vector of training labels associated with each document identified in train. (These will be converted to factors if not already factors.)

...

additional arguments passed to cv.glmnet()

References

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software 33(1), 1-22. tools:::Rd_expr_doi("10.18637/jss.v033.i01")

See Also

cv.glmnet(), predict.textmodel_lr(), coef.textmodel_lr()

Examples

Run this code
## Example from 13.1 of _An Introduction to Information Retrieval_
library("quanteda")
corp <- corpus(c(d1 = "Chinese Beijing Chinese",
                 d2 = "Chinese Chinese Shanghai",
                 d3 = "Chinese Macao",
                 d4 = "Tokyo Japan Chinese",
                 d5 = "London England Chinese",
                 d6 = "Chinese Chinese Chinese Tokyo Japan"),
               docvars = data.frame(train = factor(c("Y", "Y", "Y", "N", "N", NA))))
dfmat <- dfm(tokens(corp), tolower = FALSE)

## simulate bigger sample as classification on small samples is problematic
set.seed(1)
dfmat <- dfm_sample(dfmat, 50, replace = TRUE)

## train model
(tmod1 <- textmodel_lr(dfmat, docvars(dfmat, "train")))
summary(tmod1)
coef(tmod1)

## predict probability and classes
predict(tmod1, type = "prob")
predict(tmod1)

Run the code above in your browser using DataLab