splitStrings: Construct unigram and bigram matrices from a vector of strings

Description

A (possibly large) vector of strings is separated into sparse pattern matrices, which allows for efficient computation on the strings.

Usage

splitStrings(strings, sep = "", bigrams = TRUE, boundary = TRUE,
	bigram.binder = "", gap.symbol = "\u2043", left.boundary = "#",
	right.boundary = "#", simplify = FALSE)

Value

By default, the output is a list of six elements:

segments: A vector with all splitted parts (i.e. all tokens) in order of occurrence, separated between the original strings with gap symbols.
unigrams: A vector with all unique parts occuring in the segments.
bigrams: Only present when bigrams = T. A vector with all unique bigrams.
SW: A sparse pattern matrix of class ngCMatrix specifying the distribution of segments (S) over the original strings (W, think `words'). This matrix is only interesting in combination with the following matrices.
US: A sparse pattern matrix of class ngCMatrix specifying the distribution of the unique unigrams (U) over the tokenized segments (S).
BS: Only present when bigrams = T. A sparse pattern matrix of class ngCMatrix specifying the distribution of the unique bigrams (B) over the tokenized segments (S)

When simplify = T the output is a single sparse matrix of class dgCMatrix. This is basically BS %8% SW (when bigrams = T) or US %*% SW (when bigrams = F) with rows and column names added into the matrix.

Arguments

strings: Vector of strings to be separated into sparse matrices
sep: Separator used to split the strings into parts. This will be passed to strsplit internally, so there is no fine-grained control possible over the splitting. If it is important to get the splitting exactly right, consider pre-processing the splitting by inserting a special symbol on the split-positions, and then choosing to split by this specific symbol.
bigrams: By default, both unigrams and bigrams are computer. If bigrams are not needed, setting bigrams = F will save on resources.
boundary: Should a start symbol and a stop symbol be added to each string? This will only be used for the determination of bigrams, and will be ignored if bigrams = F.
bigram.binder: Only when bigrams = T. What symbol(s) should occur between the two parts of the bigram?
gap.symbol: Only when bigram = T. What symbol should be included to separate the strings? It defaults to U+2043 HYPHEN BULLET on the assumption that this character will not often be included in data. See pwMatrix for some more explanation about the necessity of this gap symbol.
left.boundary, right.boundary: Symbols to be used as boundaries, only used when boundary = T.
simplify: By default, various vectors and matrices are returned. However, when simplify = T, only a single sparse matrix is returned. See Value.

Author

Michael Cysouw

Examples

Run this code

# a simple example to see the function at work
example <- c("this","is","an","example")
splitStrings(example)
splitStrings(example, simplify = TRUE)

# \donttest{
# a bit larger, but still quick and efficient
# taking 15526 wordforms from the English Dalby Bible and splitting them into bigrams
data(bibles)
words <- splitText(bibles$eng)$wordforms
system.time( S <- splitStrings(words, simplify = TRUE) )

# and then taking the cosine similarity between the bigram-vectors for all word pairs
system.time( sim <- cosSparse(S) )

# most similar words to "father"
sort(sim["father",], decreasing = TRUE)[1:20]
# }