read.dsm.ucs: Load Raw DSM Data from Disk Files in UCS Export Format (wordspace)

Description

This function loads raw DSM data -- a cooccurrence frequency matrix and tables of marginal frequencies -- in UCS export format. The data are read from a directory containing several text files with predefined names, which can optionally be compressed (see ‘File Format’ below for details).

Usage

read.dsm.ucs(filename, encoding = getOption("encoding"), verbose = FALSE)

Value

An object of class dsm containing a dense or sparse DSM.

Note that the information tables for target terms (field rows) and feature terms (field cols) include the correct marginal frequencies from the UCS export files. Nonzero counts for rows are and columns are added automatically unless they are already present in the disk files. Additional fields from the information tables as well as all global variables are preserved with their original names.

Arguments

filename: the name of a directory containing files with the raw DSM data.
encoding: character encoding of the input files, which will automatically be converted to R's internal representation if possible. See ‘Encoding’ in file for details.
verbose: if TRUE, a few progress and information messages are shown

File Format

The UCS export format is a directory containing the following files with the specified names:

M or M.mtx

cooccurrence matrix (dense, plain text) or sparse matrix (MatrixMarket format)
rows.tbl

row information (labels term, marginal frequencies f)
cols.tbl

column information (labels term, marginal frequencies f)
globals.tbl

table with single row containing global variables; must include variable N specifying sample size

Each individual file may be compressed with an additional filename extension .gz, .bz2 or .xz; read.dsm.ucs automatically decompresses such files when loading them.

Author

Stephanie Evert (https://purl.org/stephanie.evert)

References

The UCS toolkit is a software package for collecting and manipulating co-occurrence data available from http://www.collocations.de/software.html.

UCS relies on compressed text files as its main storage format. They can be exported as a DSM with ucs-tool export-dsm-matrix.