This is the constructor function for dsm
objects representing distributional semantic models,
i.e. a co-occurrence matrix together with additional information on target terms (rows) and features (columns).
A new DSM can be initialised with a dense or sparse co-occurrence matrix, or with a triplet representation of a sparse matrix.
dsm(M = NULL, target = NULL, feature = NULL, score = NULL,
rowinfo = NULL, colinfo = NULL, N = NA,
globals = list(), raw.freq = FALSE, sort = FALSE, verbose = FALSE)
a dense or sparse co-occurrence matrix. A sparse matrix must be a subclass of sparseMatrix
from the Matrix
package. See "Details" below.
a character vector of target terms (see "Details" below)
a character vector of feature terms (see "Details" below)
a numeric vector of co-occurrence frequencies or weighted/transformed scores (see "Details" below)
a data frame containing information about the rows of the co-occurrence matrix, corresponding to target terms. The data frame must include a column term
with the target term labels. If unspecified, a minimal rowinfo
table is compiled automatically (see "Details" below).
a data frame containing information about the columns of the co-occurrence matrix, corresponding to feature terms. The data frame must include a column term
with the feature term labels. If unspecified, a minimal colinfo
table is compiled automatically (see "Details" below).
a single numeric value specifying the effective sample size of the co-occurrence matrix. This value may be determined automatically if raw.freq=TRUE
.
a list of global variables, which are included in the globals
field of the DSM object. May contain an entry for the sample size \(N\), which can be overridden by an explicitly specified value in the argument N
.
if TRUE
, entries of the co-occurrence matrix are interpreted as raw frequency counts. By default, it is assumed that some weighting/transformation has already been applied.
if TRUE
, sort rows and columns of a co-occurrence matrix specified in triplet form alphabetically. If the matrix is given directly (in argument M
), rows and columns are never reordered.
if TRUE
, a few progress and information messages are shown
An object of class dsm
, a list with the following components:
A co-occurrence matrix of raw frequency counts in canonical format (see dsm.canonical.matrix
).
A weighted and transformed co-occurrence matrix ("score" matrix) in canonical format (see dsm.canonical.matrix
).
Either M
or S
or both may be present. The object returned by dsm()
will include M
if raw.freq=TRUE
and S
otherwise.
A data frame with information about the target terms, corresponding to the rows of the co-occurrence matrix. The data frame usually has at least three columns:
rows$term
the target term = row label
rows$f
marginal frequency of the target term; must be present if the DSM object contains a raw co-occurrence matrix M
rows$nnzero
number of nonzero entries in the corresponding row of the co-occurrence matrix
A data frame with information about the feature terms, corresponding to the columns of the co-occurrence matrix, in the same format as rows
.
A list of global variables. The following variables have a special meaning:
globals$N
effective sample size of the underlying corpus; may be NA
if raw co-occurrence counts are not available
globals$locked
if TRUE
, the marginal frequencies are no longer valid due to a merge
, rbind
or cbind
operation; in this case, association scores cannot be computed from the co-occurrence frequencies M
The co-occurrence matrix forming the core of the distributional semantic model (DSM) can be specified in two different ways:
As a dense or sparse matrix in argument M
. A sparse matrix must be a subclass of dMatrix
(from the Matrix
package) and is automatically converted to the canonical storage mode used by the wordspace
package. Row and column labels may be specified with arguments target
and feature
, which must be character vectors of suitable length; otherwise dimnames(M)
are used.
As a triplet representation in arguments target
(row label), feature
(column label) and score
(co-occurrence frequency or pre-computed score). The three arguments must be vectors of the same length; each set of corresponding elements specifies a non-zero cell of the co-occurrence matrix. If multiple entries for the same cell are given, their frequency or score values are added up.
The optional arguments rowinfo
and colinfo
are data frames with additional information about target and feature terms. If they are specified, they must contain a column $term
matching the row or column labels of the co-occurrence matrix. Marginal frequencies and nonzero or document counts can be given in columns $f
and $nnzero
; any further columns are interpreted as meta-information on the target or feature terms. The rows of each data frame are automatically reordered to match the rows or columns of the co-occurrence matrix. Target or feature terms that do not appear in the co-occurrence matrix are silently discarded.
Counts of nonzero cells for each row and column are computed automatically, unless they are already present in the rowinfo
and colinfo
data frames. If the co-occurrence matrix contains raw frequency values, marginal frequencies for the target and feature terms are also computed automatically unless given in rowinfo
and colinfo
; the same holds for the effective sample size N
.
If raw.freq=TRUE
, all matrix entries must be non-negative; fractional frequency counts are allowed, however.
See dsm.canonical.matrix
for a description of the canonical matrix formats. DSM objects are usually loaded directly from a disk file in UCS (read.dsm.ucs
) or triplet (read.dsm.triplet
) format.
# NOT RUN {
MyDSM <- dsm(
target = c("boat", "boat", "cat", "dog", "dog"),
feature = c("buy", "use", "feed", "buy", "feed"),
score = c(1, 3, 2, 1, 1),
raw.freq = TRUE
)
print(MyDSM) # 3 x 3 matrix with 5 out of 9 nonzero cells
print(MyDSM$M) # the actual co-occurrence matrix
print(MyDSM$rows) # row information
print(MyDSM$cols) # column information
# }
Run the code above in your browser using DataLab