Learn R Programming

tm (version 0.7-12)

weightTfIdf: Weight by Term Frequency - Inverse Document Frequency

Description

Weight a term-document matrix by term frequency - inverse document frequency.

Usage

weightTfIdf(m, normalize = TRUE)

Value

The weighted matrix.

Arguments

m

A TermDocumentMatrix in term frequency format.

normalize

A Boolean value indicating whether the term frequencies should be normalized.

Details

Formally this function is of class WeightingFunction with the additional attributes name and acronym.

Term frequency \(\mathit{tf}_{i,j}\) counts the number of occurrences \(n_{i,j}\) of a term \(t_i\) in a document \(d_j\). In the case of normalization, the term frequency \(\mathit{tf}_{i,j}\) is divided by \(\sum_k n_{k,j}\).

Inverse document frequency for a term \(t_i\) is defined as $$\mathit{idf}_i = \log_2 \frac{|D|}{|\{d \mid t_i \in d\}|}$$ where \(|D|\) denotes the total number of documents and where \(|\{d \mid t_i \in d\}|\) is the number of documents where the term \(t_i\) appears.

Term frequency - inverse document frequency is now defined as \(\mathit{tf}_{i,j} \cdot \mathit{idf}_i\).

References

Gerard Salton and Christopher Buckley (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24/5, 513--523.