subset.dsm: Subsetting Distributional Semantic Models (wordspace)

Description

Filter the rows and/or columns of a DSM object according to user-specified conditions.

Usage

# S3 method for dsm
subset(x, subset, select, recursive = FALSE, drop.zeroes = FALSE,
       matrix.only = FALSE, envir = parent.frame(), run.gc = FALSE, ...)

Value

An object of class dsm containing the specified subset of the model x.

If necessary, counts of nonzero elements for each row and/or column are updated automatically.

Arguments

x: an object of class dsm
subset: Boolean expression or index vector selecting a subset of the rows; the expression can use variables term and f to access target terms and their marginal frequencies, nnzero for the number of nonzero elements in each row, further optional variables from the row information table, as well as global variables such as the sample size N
select: Boolean expression or index vector selecting a subset of the columns; the expression can use variables term and f to access feature terms and their marginal frequencies, nnzero for the number of nonzero elements in each column, further optional variables from the column information table, as well as global variables such as the sample size N
recursive: if TRUE and both subset and select conditions are specified, the subset is applied repeatedly until the DSM no longer changes. This is typically needed if conditions on nonzero counts or row/column norms are specified, which may be affected by the subsetting procedure.
drop.zeroes: if TRUE, all rows and columns without any nonzero entries after subsetting are removed from the model (nonzero counts are based on the score matrix \(S\) if available, raw cooccurrence frequencies \(M\) otherwise)
matrix.only: if TRUE, return only the selected subset of the score matrix \(S\) (if available) or frequency matrix \(M\), not a full DSM object. This may conserve a substantial amount of memory when processing very large DSMs.
envir: environment in which the subset and select conditions are evaluated. Defaults to the context of the function call, so all variables visible there can be used in the expressions.
run.gc: whether to run the garbage collector after each iteration of a recursive subset (recursive=TRUE) in order to keep memory overhead as low as possible. This option should only be specified if memory is very tight, since garbage collector runs can be expensive (e.g. when there are many distinct strings in the workspace).
...: any further arguments are silently ignored

Author

Stephanie Evert (https://purl.org/stephanie.evert)

Examples

Run this code


print(DSM_TermContext$M)
model <- DSM_TermContext

subset(model, nchar(term) <= 4)$M     # short target terms
subset(model, select=(nnzero <= 3))$M # columns with <= 3 nonzero cells

subset(model, nchar(term) <= 4, nnzero <= 3)$M # combine both conditions

subset(model, nchar(term) <= 4, nnzero >= 2)$M # still three columns with nnzero < 2
subset(model, nchar(term) <= 4, nnzero >= 2, recursive=TRUE)$M