Learn R Programming

wordspace (version 0.2-0)

read.dsm.triplet: Load DSM Data from Triplet Representation (wordspace)

Description

This functions loads a sparse distributional semantic model in triplet representation -- (target label, feature label, score) -- from a disk file or a pipe. Such a triplet file usually represents a pre-scored DSM, but it can also be used to read raw co-occurrence frequencies. In this case, marginals and sample size can either be derived from the co-occurrence matrix (for syntactic and term-context models) or provided in separate TAB-delimited tables (for surface and textual co-occurrence, or if frequency thresholds have been applied).

Usage

read.dsm.triplet(filename, freq = FALSE, value.first = FALSE, tokens = FALSE,
                 rowinfo = NULL, rowinfo.header = NULL,
                 colinfo = NULL, colinfo.header = NULL,
                 N = NA, span.size = 1,
                 sep = "\t", quote = "", nmax = -1, sort = FALSE, 
                 encoding = getOption("encoding"), verbose = FALSE)

Arguments

filename

the name of a file containing the triplet data (see ‘File Format’ below for details), which may be compressed (.gz, .bz2, .xz). If filename ends in |, it is opened as a Unix pipe for reading.

freq

whether values are raw co-occurrence frequencies (TRUE) or pre-computed scores (FALSE)

value.first

if TRUE, triplets are given as (score, row label, column label) instead of the default (row label, column label, score)

tokens

if TRUE, the input file contains pair tokens, i.e. row and column labels without score/frequency values. Co-occurrence frequencies will automatically be calculated, but this input format should only be used for small samples up to a few millon tokens.

rowinfo

the name of an optional TAB-delimited table file with additional information about the target terms (see ‘File Format’ below for details), which may be compressed (.gz, .bz2, .xz).

rowinfo.header

if the rowinfo file does not start with a header row, specify its column names as a character vector here

colinfo

the name of an optional TAB-delimited table file with additional information about the feature terms or contexts (see ‘File Format’ below for details), which may be compressed (.gz, .bz2, .xz).

colinfo.header

if the colinfo file does not start with a header row, specify its column names as a character vector here

N

sample size to assume for the distributional model (see ‘Details’ below)

span.size

if marginal frequencies are provided externally for surface co-occurrence, they need to be adjusted for span size. If this hasn't been taken into account in data extraction, it can be approximated by specifying the total number of tokens in a span here (see ‘Details’ below).

sep, quote

specify field separator and the types of quotes used by the disk file (see the scan documentation for details). By default, a TAB-delimited file without quotes is assumed.

nmax

if the number of entries (= text lines) in the triplet file is known, it can be specified here in order to make loading faster and more memory-efficient. Caution: If nmax is smaller than the number of lines in the disk file, the extra lines will silently be discarded.

sort

if TRUE, the rows and columns of the co-occurrence matrix will be sorted alphabetically according to their labels (i.e. the target and feature terms); otherwise they are listed as encountered in the triplet representation

encoding

character encoding of the input files, which will automatically be converted to R's internal representation if possible. See ‘Encoding’ in file for details.

verbose

if TRUE, a few progress and information messages are shown

Value

An object of class dsm containing a sparse DSM.

For a model of type 1 (pre-scored) it will include the score matrix $S but no co-occurrence frequency data. Such a DSM object cannot be passed to dsm.score, except with score="reweight". For models of type 2 and 3 it will include the matrix of raw co-occurrence frequencies $M, but no score matrix.

File Format

Triplet files

The triplet file must be a plain-text table with two or three TAB-delimited columns and no header. It may be compressed in .gz, .bz2 or .xz format.

For tokens=TRUE, each line represents a single pair token with columns

  1. target term

  2. feature term / context

For tokens=FALSE, each line represents a pair type (i.e. a unique cell of the co-occurrence matrix) with columns:

  1. target term

  2. feature term / context

  3. score (freq=FALSE) or co-occurrence frequency (freq=TRUE)

If value.first=TRUE, the score entry is expected in the first column:
  1. score or co-occurrence frequency

  2. target term

  3. feature term / context

Note that the triplet file may contain multiple entries for the same cell, whose values will automatically be added up. This might not be very sensible for pre-computed scores.

Row and column information

Additional information about target terms (matrix rows) and feature terms / contexts (matrix columns) can be provided in additional TAB-delimited text tables, optionally compressed in .gz, .bz2 or .xz format.

Such tables can have an arbitrary number of columns whose data types are inferred from the first few rows of the table. Tables should start with a header row specifying the column labels; otherwise they must be passed in the rowinfo.header and colinfo.header arguments.

Every table must contain a column term listing the target terms or feature terms / contexts. Their ordering need not be the same as in the main co-occurrence matrix, and redundant entries will silently be dropped.

If freq=TRUE or tokens=TRUE, the tables must also contain marginal frequencies in a column f. Nonzero counts for rows and columns of the matrix are automatically added unless a column nnzero is already present.

Details

The function read.dsm.triplet can be used to read triplet representations of three different types of DSM.

1. A pre-scored DSM matrix

If freq=FALSE and tokens=FALSE, the triplet file is assumed to contain pre-scored entries of the DSM matrix. Marginal frequencies are not required for such a model, but additional information about targets and features can be provided in separate rowinfo= and colinfo= files.

2. Raw co-occurrence frequencies (syntactic or term-context)

If the triplet file contains syntactic co-occurrence frequencies or term-document frequency counts, specify freq=TRUE. For small data sets, frequencies can also be aggregated directly in R from co-occurrence tokens; specify tokens=TRUE.

Unless high frequency thresholds or other selective filters have been applied to the input data, the marginal frequencies of targets and features as well as the sample size can automatically be derived from the co-occurrence matrix. Do not specify rowinfo= or colinfo= in this case!

Evert (2008) explains the differences between syntactic, textual and surface co-occurrence.

3. Raw co-occurrence frequencies with explicit marginals

For surface and textual co-occurrence data, the correct marginal frequencies cannot be derived automatically and have to be provided in auxiliary table files specified with rowinfo= and colinfo. These files must contain a column f with the marginal frequency data. In addition, the total sample size (which cannot be derived from the marginals) has to be passed in the argument N=. Of course, it is still necessary to specify freq=TRUE (or token=TRUE) in order to indicate that the input data aren't pre-computed scores.

The computation of consistent marginal frequencies is particulary tricky for surface co-occurrence (Evert 2008, p. 1233f) and specialized software should be used for this purpose. As an approximation, simple corpus frequencies of target and feature terms can be corrected by a factor corresponding to the total size of the collocational span (e.g. span.size=8 for a symmetric L4/R4 span, cf. Evert 2008, p. 1225). The read.dsm.triplet function applies this correction to the row marginals.

Explicit marginals should also be provided if syntactic co-occurrence data or text-context frequencies have been filtered, either individually with a frequency threshold or by selecting a subset of the targets and features. See the examples below for an illustration.

References

Evert, Stefan (2008). Corpora and collocations. In A. L<U+00FC>deling and M. Kyt<U+00F6> (eds.), Corpus Linguistics. An International Handbook, chapter 58, pages 1212--1248. Mouton de Gruyter, Berlin, New York.

See Also

dsm, read.dsm.ucs

Examples

Run this code
# NOT RUN {
## this helper function displays the cooccurrence matrix together with marginals
with.marginals <- function (x) {
  y <- x$M
  rownames(y) <- with(x$rows, sprintf("%-8s | %6d", term, f))
  colnames(y) <- with(x$cols, sprintf("  %s | %d", term, f))
  y
}

## we will read this term-context DSM from a triplet file included in the package
with.marginals(DSM_TermContext)

## the triplet file with term-document frequencies
triplet.file <- system.file("extdata", "term_context_triplets.gz", package="wordspace")
cat(readLines(triplet.file), sep="\n") # file format

## marginals incorrect because matrix covers only subset of targets & features
TC1 <- read.dsm.triplet(triplet.file, freq=TRUE)
with.marginals(TC1) # marginal frequencies far too small

## TAB-delimited file with marginal frequencies and other information
marg.file <- system.file("extdata", "term_context_marginals.txt.gz", package="wordspace")
cat(readLines(marg.file), sep="\n") # notice the header row with "term" and "f"

## single table with marginals for rows and columns, but has to be specified twice
TC2 <- read.dsm.triplet(triplet.file, freq=TRUE, 
                        rowinfo=marg.file, colinfo=marg.file, N=108771103)
with.marginals(TC2) # correct marginal frequencies

## marginals table without header: specify column lables separately
no.hdr <- system.file("extdata", "term_context_marginals_noheader.txt", 
                      package="wordspace")
hdr.names <- c("term", "f", "df", "type")
TC3 <- read.dsm.triplet(triplet.file, freq=TRUE, 
                        rowinfo=no.hdr, rowinfo.header=hdr.names,
                        colinfo=no.hdr, colinfo.header=hdr.names, N=108771103)
all.equal(TC2, TC3, check.attributes=FALSE) # same result
# }

Run the code above in your browser using DataLab