Learn R Programming

quanteda (version 0.99.12)

textmodel: fit a text model

Description

Fit a text model to a dfm. Creates an object of virtual class textmodel_fitted-class, whose exact properties (slots and methods) will depend on which model was called (see model types below).

Usage

textmodel(x, y = NULL, data = NULL, model = c("wordscores", "NB",
  "wordfish", "ca"), ...)

# S4 method for dfm,ANY,missing,character textmodel(x, y = NULL, data = NULL, model = c("wordscores", "NB", "wordfish", "ca"), ...)

# S4 method for formula,missing,dfm,character textmodel(x, y = NULL, data = NULL, model = c("wordscores", "NB", "wordfish", "ca"), ...)

Arguments

x

a quanteda dfm object containing feature counts by document

y

for supervised models, a vector of class labels or values for training the model, with NA for documents to be excluded from the training set; for unsupervised models, this will be left NULL.

data

dfm or data.frame from which to take the formula

model

the model type to be fit. Currently implemented methods are:

wordscores

Fits the "wordscores" model of Laver, Benoit, and Garry (2003). Options include the original linear scale of LBG or the logit scale proposed by Beauchamps (2001). See textmodel_wordscores.

NB

Fits a Naive Bayes model to the dfm, with options for smoothing, setting class priors, and a choice of multinomial or binomial probabilities. See textmodel_NB.

wordfish

Fits the "wordfish" model of Slapin and Proksch (2008). See textmodel_wordfish.

ca

Correspondence analysis scaling of the dfm.

lda

Fit a topic model based on latent Dirichlet allocation. Not yet implemented -- use convert to convert a dfm into the applicable input format and then use your favourite topic modelling package directly.

kNN

k-nearest neighbour classification, coming soon.

...

additional arguments to be passed to specific model types

formula

An object of class formula of the form y ~ x1 + x2 + .... (Interactions are not currently allowed for any of the models implemented.) The x variable(s) is typically a dfm, and the y variable a vector of class labels or training values associated with each document.

Value

a textmodel class list, containing the fitted model and additional information specific to the model class. See the methods for specific models, e.g. textmodel_wordscores, etc.

Class hierarchy

Here will go the description of the class hierarchy that governs dispatch for the predict, print, summary methods, since this is not terribly obvious. (Blame it on the S3 system.)

See Also

textmodel, textmodel_wordscores

Examples

Run this code
# NOT RUN {
ieDfm <- dfm(data_corpus_irishbudget2010, verbose=FALSE)
refscores <- c(rep(NA, 4), -1, 1, rep(NA, 8))
ws <- textmodel(ieDfm, refscores, model="wordscores", smooth=1)

# alternative formula notation - but slower
# need the - 1 to remove the intercept, as this is literal formula notation
wsform <- textmodel(refscores ~ . - 1, data=ieDfm, model="wordscores", smooth=1)
identical(ws@Sw, wsform@Sw)  # compare wordscores from the two models


# compare the logit and linear wordscores
bs <- textmodel(ieDfm[5:6,], refscores[5:6], model="wordscores", scale="logit", smooth=1)
plot(ws@Sw, bs@Sw, xlim=c(-1, 1), xlab="Linear word score", ylab="Logit word score")

# }
# NOT RUN {
wf <- textmodel(ieDfm, model="wordfish", dir = c(6,5))
wf
# }

Run the code above in your browser using DataLab