Learn R Programming

quanteda (version 0.9.9-50)

textmodel_NB: Naive Bayes classifier for texts

Description

Currently working for vectors of texts -- not specially defined for a dfm.

Usage

textmodel_NB(x, y, smooth = 1, prior = c("uniform", "docfreq", "termfreq"),
  distribution = c("multinomial", "Bernoulli"), ...)

Arguments

x
the dfm on which the model will be fit. Does not need to contain only the training documents.
y
vector of training labels associated with each document identified in train. (These will be converted to factors if not already factors.)
smooth
smoothing parameter for feature counts by class
prior
prior distribution on texts, see details
distribution
count model for text features, can be multinomial or Bernoulli
...
more arguments passed through

Value

A list of return values, consisting of:

call
original function call

PwGc
probability of the word given the class (empirical likelihood)

Pc
class prior probability

PcGw
posterior class probability given the word

Pw
baseline probability of the word

data
list consisting of x training class, and y test class

distribution
the distribution argument

prior
argument passed as a prior

smooth
smoothing parameter

Predict Methods

A predict method is also available for a fitted Naive Bayes object, see predict.textmodel_NB_fitted.

Details

This naive Bayes model works on word counts, with smoothing.

Examples

Run this code
## Example from 13.1 of _An Introduction to Information Retrieval_
trainingset <- as.dfm(matrix(c(1, 2, 0, 0, 0, 0,
                        0, 2, 0, 0, 1, 0,
                        0, 1, 0, 1, 0, 0,
                        0, 1, 1, 0, 0, 1,
                        0, 3, 1, 0, 0, 1), 
                      ncol=6, nrow=5, byrow=TRUE,
                      dimnames = list(docs = paste("d", 1:5, sep = ""),
                                      features = c("Beijing", "Chinese",  "Japan", "Macao", 
                                                   "Shanghai", "Tokyo"))))
trainingclass <- factor(c("Y", "Y", "Y", "N", NA), ordered = TRUE)
## replicate IIR p261 prediction for test set (document 5)
(nb.p261 <- textmodel_NB(trainingset, trainingclass))
predict(nb.p261, newdata = trainingset[5, ])

# contrast with other priors
predict(textmodel_NB(trainingset, trainingclass, prior = "docfreq"))
predict(textmodel_NB(trainingset, trainingclass, prior = "termfreq"))

Run the code above in your browser using DataLab