dsm: Create DSM Object Representing a Distributional Semantic Model (wordspace)

Description

This is the constructor function for dsm objects representing distributional semantic models, i.e. a co-occurrence matrix together with additional information on target terms (rows) and features (columns). A new DSM can be initialised with a dense or sparse co-occurrence matrix, or with a triplet representation of a sparse matrix.

Usage

dsm(M = NULL, target = NULL, feature = NULL, score = NULL,
    rowinfo = NULL, colinfo = NULL, N = NA,
    globals = list(), raw.freq = FALSE, sort = FALSE, verbose = FALSE)

Arguments

a dense or sparse co-occurrence matrix. A sparse matrix must be a subclass of sparseMatrix from the Matrix package. See "Details" below.

target

a character vector of target terms (see "Details" below)

feature

a character vector of feature terms (see "Details" below)

score

a numeric vector of co-occurrence frequencies or weighted/transformed scores (see "Details" below)

rowinfo

a data frame containing information about the rows of the co-occurrence matrix, corresponding to target terms. The data frame must include a column term with the target term labels. If unspecified, a minimal rowinfo table is compiled automatically (see "Details" below).

colinfo

a data frame containing information about the columns of the co-occurrence matrix, corresponding to feature terms. The data frame must include a column term with the feature term labels. If unspecified, a minimal colinfo table is compiled automatically (see "Details" below).

a single numeric value specifying the effective sample size of the co-occurrence matrix. This value may be determined automatically if raw.freq=TRUE.

globals

a list of global variables, which are included in the globals field of the DSM object. May contain an entry for the sample size $N$, which can be overridden by an explicitly specified value in the argument N.

raw.freq

if TRUE, entries of the co-occurrence matrix are interpreted as raw frequency counts. By default, it is assumed that some weighting/transformation has already been applied.

sort

if TRUE, sort rows and columns of a co-occurrence matrix specified in triplet form alphabetically. If the matrix is given directly (in argument M), rows and columns are never reordered.

verbose

if TRUE, a few progress and information messages are shown

Value

An object of class dsm, a list with the following components:

A co-occurrence matrix of raw frequency counts in canonical format (see dsm.canonical.matrix).

A weighted and transformed co-occurrence matrix ("score" matrix) in canonical format (see dsm.canonical.matrix). Either M or S or both may be present. The object returned by dsm() will include M if raw.freq=TRUE and S otherwise.

rows

A data frame with information about the target terms, corresponding to the rows of the co-occurrence matrix. The data frame usually has at least three columns:

rows$term: the target term = row label
rows$f: marginal frequency of the target term; must be present if the DSM object contains a raw co-occurrence matrix M
rows$nnzero: number of nonzero entries in the corresponding row of the co-occurrence matrix

Further columns may provide additional information.

cols

A data frame with information about the feature terms, corresponding to the columns of the co-occurrence matrix, in the same format as rows.

globals

A list of global variables. The following variables have a special meaning:

globals$N: effective sample size of the underlying corpus; may be NA if raw co-occurrence counts are not available
globals$locked: if TRUE, the marginal frequencies are no longer valid due to a merge, rbind or cbind operation; in this case, association scores cannot be computed from the co-occurrence frequencies M

Details

The co-occurrence matrix forming the core of the distributional semantic model (DSM) can be specified in two different ways:

As a dense or sparse matrix in argument M. A sparse matrix must be a subclass of dMatrix (from the Matrix package) and is automatically converted to the canonical storage mode used by the wordspace package. Row and column labels may be specified with arguments target and feature, which must be character vectors of suitable length; otherwise dimnames(M) are used.
As a triplet representation in arguments target (row label), feature (column label) and score (co-occurrence frequency or pre-computed score). The three arguments must be vectors of the same length; each set of corresponding elements specifies a non-zero cell of the co-occurrence matrix. If multiple entries for the same cell are given, their frequency or score values are added up.

The optional arguments rowinfo and colinfo are data frames with additional information about target and feature terms. If they are specified, they must contain a column $term matching the row or column labels of the co-occurrence matrix. Marginal frequencies and nonzero or document counts can be given in columns $f and $nnzero; any further columns are interpreted as meta-information on the target or feature terms. The rows of each data frame are automatically reordered to match the rows or columns of the co-occurrence matrix. Target or feature terms that do not appear in the co-occurrence matrix are silently discarded.

Counts of nonzero cells for each row and column are computed automatically, unless they are already present in the rowinfo and colinfo data frames. If the co-occurrence matrix contains raw frequency values, marginal frequencies for the target and feature terms are also computed automatically unless given in rowinfo and colinfo; the same holds for the effective sample size N.

If raw.freq=TRUE, all matrix entries must be non-negative; fractional frequency counts are allowed, however.

Examples

Run this code

# NOT RUN {
MyDSM <- dsm(
  target =  c("boat", "boat", "cat",  "dog", "dog"),
  feature = c("buy",  "use",  "feed", "buy", "feed"),
  score =   c(1,      3,      2,      1,     1),
  raw.freq = TRUE
)

print(MyDSM)   # 3 x 3 matrix with 5 out of 9 nonzero cells
print(MyDSM$M) # the actual co-occurrence matrix

print(MyDSM$rows) # row information
print(MyDSM$cols) # column information

# }

Run the code above in your browser using DataLab