splitWordlist: Construct sparse matrices from comparative wordlists (aka `Swadesh list')

Description

A comparative wordlist (aka `Swadesh list') is a collection of wordforms from different languages, which are translations of a selected set of meanings. This function dismantles this data structure into a set of sparse matrices.

Usage

splitWordlist(data,
	doculects = "DOCULECT", concepts = "CONCEPT", counterparts = "COUNTERPART",
	splitstrings = TRUE, sep =  "", bigram.binder = "", grapheme.binder = "_", 
	simplify = FALSE)

Value

There are four different possible outputs, depending on the option chosen.

By default, when splitstrings = T, simplify = F, the following list of 15 objects is returned. It starts with 8 different character vectors, which are actually the row/column names of the following sparse pattern matrices. The naming of the objects is an attempt to make everything easy to remember.

doculects: Character vector with names of doculects in the data
concepts: Character vector with names of concepts in the data
words: Character vector with all words, i.e. unique counterparts per language. The same string in the same language is only included once, but an identical string occurring in different doculect is separately included for each doculects.
segments: Character vector with all unigram-tokens in order of appearance, including boundary symbols and gap symbols (see splitStrings for more information about the gap symbols)
unigrams: Character vector with all unique unigrams in the data
bigrams: Character vector with all unique bigrams in the data
graphemes: Character vector with all unique graphemes (i.e. combinations of unigrams+doculects) occurring in the data
digraphs: Character vector with all unique digraphs (i.e. combinations of bigrams+doculects) occurring in the data
DW: Sparse pattern matrix of class ngCMatrix linking doculects (D) to words (W)
CW: Sparse pattern matrix of class ngCMatrix linking concepts (C) to words (W)
SW: Sparse pattern matrix of class ngCMatrix linking all token-segments (S) to words (W)
US: Sparse pattern matrix of class ngCMatrix linking unigrams (U) to segments (S)
BS: Sparse pattern matrix of class ngCMatrix linking bigrams (B) to segments (S)
GS: Sparse pattern matrix of class ngCMatrix linking language-specific graphemes (G) to segments (S)
TS: Sparse pattern matrix of class ngCMatrix linking digraphs (T, as no other letter was available) to segments (S)

When splitstrings = F, simplify = F, only the following objects from the above list are returned:

doculects: Character vector with names of doculects in the data
concepts: Character vector with names of concepts in the data
words: Character vector with all words, i.e. unique counterparts per language. The same string in the same language is only included once, but an identical string occurring in different doculect is separately included for each doculects.
DW: Sparse pattern matrix of class ngCMatrix linking doculects (D) to words (W)
CW: Sparse pattern matrix of class ngCMatrix linking concepts (C) to words (W)

When splitstrings = T, simplify = T only the bigram-separation is returned, and all row and columns names are included into the matrices. However, for reasons of space, the words vector is only included once:

DW: Sparse pattern matrix of class ngCMatrix linking doculects (D) to words (W). Doculects are in the rownames, colnames are left empty.
CW: Sparse pattern matrix of class ngCMatrix linking concepts (C) to words (W). Concepts are in the rownames, colnames are left empty.
BW: Sparse pattern matrix of class ngCMatrix linking bigrams (B) to words (W). Bigrams (note: not digraphs!) are in the rownames. This matrix includes all words as colnames.

Finally, when splitstrings = F, simplify = T, only the following subset of the above is returned.

DW: Sparse pattern matrix of class ngCMatrix linking doculects (D) to words (W). Doculects are in the rownames, colnames are left empty.
CW: Sparse pattern matrix of class ngCMatrix linking concepts (C) to words (W). Concepts are in the rownames, colnames are left empty.

Arguments

data: A dataframe or matrix with each row describing a combination of language (DOCULECT), meaning (CONCEPT) and translation (COUNTERPART).
doculects, concepts, counterparts: The name (or number) of the column of data in which the respective information is to be found. The defaults are set to coincide with the naming of the example dataset included in this package: huber.
splitstrings: Should the counterparts be separated into unigrams and bigrams (using splitStrings)?
sep: Separator to be passed to splitStrings to specify where to split the strings. Only used when splitstrings = T, ignored otherwise.
bigram.binder: Separator to be passed to splitStrings to be inserted between the parts of the bigrams
grapheme.binder: Separator to be used to separate a grapheme from the language name. Graphemes are language-specific symbols (i.e. the 'a' in the one language is not assumed to be the same as the 'a' from another language).
simplify: Should the output be reduced to the most important matrices only, with the row and columns names included in the matrices? Defaults to simplify = F, separating everything into different object. See Value below for details on the format of the results.

Author

Michael Cysouw

Details

The meanings that are selected for a wordlist are called CONCEPTS here, and the translations into the various languages COUNTERPARTS (following Poornima & Good 2010). The languages are called DOCULECTS (`documented lects') to generalize over their status as dialects, languages, or even small families (following Cysouw & Good 2013).

References

Cysouw, Michael & Jeff Good. 2013. Languoid, Doculect, Glossonym: Formalizing the notion “language”. Language Documentation and Conservation 7. 331-359.

Poornima, Shakthi & Jeff Good. 2010. Modeling and Encoding Traditional Wordlists for Machine Applications. Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground.

Examples

Run this code

# ----- load data -----

# an example wordlist, see the help(huber) for details
data(huber)

# ----- show output -----

# a selection, to see the result of splitWordlist
# only show the simplified output here, 
# the full output is rather long even for just these six words
sel <- c(1:3, 1255:1258)
splitWordlist(huber[sel,], simplify = TRUE)

# ----- split complete data -----

# splitting the complete wordlist is a lot of work !
# it won't get much quicker than this
# most time goes into the string-splitting of the almost 26,000 words
# Default version, included splitStrings:
system.time( H <- splitWordlist(huber) )

# Simplified version without splitStrings is much quicker:
system.time( H <- splitWordlist(huber, splitstrings = FALSE, simplify = TRUE) )

# ----- investigate colexification -----

# The simple version can be used to check how often two concepts 
# are expressed identically across all languages ('colexification')
H <- splitWordlist(huber, splitstrings = FALSE, simplify = TRUE)
sim <- tcrossprod(H$CW*1)

# select only the frequent colexifications for a quick visualisation
diag(sim) <- 0
sim <- drop0(sim, tol = 5)
sim <- sim[rowSums(sim) > 0, colSums(sim) > 0]

if (FALSE) {
# this might lead to errors on some platforms because of non-ASCII symbols
plot( hclust(as.dist(-sim), method = "average"), cex = .5)
}

# ----- investigate regular sound correspondences -----

# One central problem with data from many languages is the variation of orthography
# It is preferred to solve that problem separately
# e.g. check the column "TOKENS" in the huber data
# This is a grapheme-separated version of the data.
# can be used to investigate co-occurrence of graphemes (approx. phonemes)
H <- splitWordlist(huber, counterparts = "TOKENS", sep = " ")

# co-occurrence of all pairs of the 2150 different graphemes through all languages
system.time( G <- assocSparse( (H$CW*1) %*% t(H$SW*1) %*% t(H$GS*1), method = poi))
rownames(G) <- colnames(G) <- H$graphemes
G <- drop0(G, tol = 1)

# select only one language pair for a quick visualisation
# check the nice sound changes between bora and muinane!
GD <- H$GS %*% H$SW %*% t(H$DW)
colnames(GD) <- H$doculects
correspondences <- G[GD[,"bora"],GD[,"muinane"]]

if (FALSE) {
# this might lead to errors on some platforms because of non-ASCII symbols
heatmap(as.matrix(correspondences))
}

Run the code above in your browser using DataLab