Formally this function is of class WeightingFunction
with the
additional attributes name
and acronym
.
The first letter of spec
specifies a weighting schema for term
frequencies of m
:
- "n"
(natural) \(\mathit{tf}_{i,j}\) counts the number of occurrences
\(n_{i,j}\) of a term \(t_i\) in a document \(d_j\). The
input term-document matrix m
is assumed to be in this
standard term frequency format already.
- "l"
(logarithm) is defined as \(1 + \log_2(\mathit{tf}_{i,j})\).
- "a"
(augmented) is defined as \(0.5 +
\frac{0.5 * \mathit{tf}_{i,j}}{\max_i(\mathit{tf}_{i,j})}\).
- "b"
(boolean) is defined as 1 if \(\mathit{tf}_{i,j} > 0\) and 0 otherwise.
- "L"
(log average) is defined as \(\frac{1 +
\log_2(\mathit{tf}_{i,j})}{1+\log_2(\mathrm{ave}_{i\in j}(\mathit{tf}_{i,j}))}\).
The second letter of spec
specifies a weighting schema of
document frequencies for m
:
- "n"
(no) is defined as 1.
- "t"
(idf) is defined as \(\log_2 \frac{N}{\mathit{df}_t}\) where
\(\mathit{df}_t\) denotes how often term \(t\) occurs in all
documents.
- "p"
(prob idf) is defined as \(\max(0, \log_2(\frac{N - \mathit{df}_t}{\mathit{df}_t}))\).
The third letter of spec
specifies a schema for normalization
of m
:
- "n"
(none) is defined as 1.
- "c"
(cosine) is defined as \(\sqrt{\mathrm{col\_sums}(m ^ 2)}\).
- "u"
(pivoted unique) is defined as \(\mathit{slope} *
\sqrt{\mathrm{col\_sums}(m ^ 2)} + (1 - \mathit{slope}) *
\mathit{pivot}\) where both slope
and pivot
must be set
via named tags in the control
list.
- "b"
(byte size) is defined as
\(\frac{1}{\mathit{CharLength}^\alpha}\). The parameter
\(\alpha\) must be set via the named tag alpha
in the control
list.
The final result is defined by multiplication of the chosen term
frequency component with the chosen document frequency component with
the chosen normalization component.