This function loads raw DSM data -- a cooccurrence frequency matrix and tables of marginal frequencies -- in UCS export format. The data are read from a directory containing several text files with predefined names, which can optionally be compressed (see ‘File Format’ below for details).
read.dsm.ucs(filename, encoding = getOption("encoding"), verbose = FALSE)
the name of a directory containing files with the raw DSM data.
character encoding of the input files, which will automatically be converted to R's internal representation if possible. See ‘Encoding’ in file
for details.
if TRUE
, a few progress and information messages are shown
An object of class dsm
containing a dense or sparse DSM.
Note that the information tables for target terms (field rows
) and feature terms (field cols
) include the correct marginal frequencies from the UCS export files. Nonzero counts for rows are and columns are added automatically unless they are already present in the disk files. Additional fields from the information tables as well as all global variables are preserved with their original names.
The UCS export format is a directory containing the following files with the specified names:
M
or M.mtx
cooccurrence matrix (dense, plain text) or sparse matrix (MatrixMarket format)
rows.tbl
row information (labels term
, marginal frequencies f
)
cols.tbl
column information (labels term
, marginal frequencies f
)
globals.tbl
table with single row containing global variables; must include variable N
specifying sample size
Each individual file may be compressed with an additional filename extension .gz
, .bz2
or .xz
; read.dsm.ucs
automatically decompresses such files when loading them.
The UCS toolkit is a software package for collecting and manipulating co-occurrence data available from http://www.collocations.de/software.html.
UCS relies on compressed text files as its main storage format. They can be exported as a DSM with ucs-tool export-dsm-matrix
.