sim.words: Similarity-measures for words between two languages, based on co-occurrences in parallel text

Description

Based on co-occurrences in a parallel text, this convenience function (a wrapper around various other functions from this package) efficiently computes something close to translational equivalence.

Usage

sim.words(text1, text2 = NULL, method = res, weight = NULL, 
	lowercase = TRUE, best = FALSE, tol = 0)

Value

When best = F, a single sparse matrix is returned of type dgCMatrix with the values of the statistic chosen. All unique wordforms of text1 are included as row names, and those from text2 as column names.

When best = T, a list of two sparse matrices is returned:

sim: the same matrix as above
best: a sparse pattern matrix of type ngCMatrix with the same dimensions as the previous matrix. Only the `best' translations between the two languages are marked

Arguments

text1, text2: Vectors of strings representing sentences. The names of the vectors should contain IDs that identify the parallelism between the two texts. If there are no specific names, the function assumes that the two vectors are perfectly parallel. Within the strings, wordforms are simply separated based on spaces (i.e. everything between two spaces is a wordform). For more details about the format-assumptions, see splitText, which is used internally here.
method: Method to be used as a co-occurrence statistic. See assocSparse for a detailed presentation of the available methods. It is possible to define your own statistic, when it can be formulated as a function of observed and expected frequencies.
weight: When weight is specified, the function cosSparse is used for the co-occurrence statistics (with a Euclidean normalization, i.e. norm2). The specified weight function will be used, currently idf, sqrt, and none are available. For more details, and for instructions how to formulate your own weight function, see the discussion at cosSparse. When weight is specified, any specification of method is ignored.
lowercase: Should all words be turned into lowercase? See splitText for discussion how this is implemented.
best: When best = T, an additional sparse matrix is returned with a (simplistic) attempt to find the best translational equivalents between the texts.
tol: Tolerance: remove all values between -tol and +tol in the result. Low values can mostly be ignored for co-occurrence statistics without any loss of information. However, what is considered `low' depends on the methods used to calculate the statistics. See discussion below.

Author

Michael Cysouw

Details

Care is taken in this function to match multiple verses that are translated into one verse, see bibles for a survey of the encoding assumptions taken here.

The parameter method can take anything that is also available for assocSparse. Similarities are computed using that function.

When weight is specified, the similarities are computed using cosSparse with default setting of norm = norm2. All available weights can also be used here.

The option best = T uses rowMax and colMax. This approach to get the `best' translation is really crude, but it works reasonably well with one-to-one and many-to-one situations. This option takes rather a lot more time to finish, as row-wise maxima for matrices is not trivial to optimize. Consider raising tol, as this removes low values that won't be important for the maxima anyway. See examples below.

Guidelines for the value of tol are difficult to give, as it depends on the method used, but also on the distribution of the data (i.e. the number of sentences, and the frequency distribution of the words in the text). Some suggestions:

when weight is specified, results range between -1 and +1. Then tol = 0.1 should never lead to problems, but often even tol = 0.3 or higher will lead to identical results.
when weight is not specified (i.e. assocSparse will be used), then results range between -inf and +inf, so the tolerance is more problematic. In general, tol = 2 seems to be unproblematic. Higher tolerance, e.g. tol = 10 can be used to find the `obvious' translations, but you will loose some of the more incidental co-occurrences.

References

Mayer, Thomas and Michael Cysouw. 2012. Language comparison through sparse multilingual word alignment. Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, 54--62. Avignon: Association for Computational Linguistics.

Examples

Run this code

data(bibles)

# ----- small example of co-occurrences -----

# as an example, just take partially overlapping parts of two bibles
# sim.words uses the names to get the paralellism right, so this works
eng <- bibles$eng[1:5000]
deu <- bibles$deu[2000:7000]
sim <- sim.words(eng, deu, method = res)

# but the statistics are not perfect (because too little data)
# sorted co-occurrences for the english word "your" in German:
sort(sim["your",], decreasing = TRUE)[1:10]

# \donttest{
# ----- complete example of co-occurrences -----

# running the complete bibles takes a bit more time (but still manageable)
system.time(sim <- sim.words(bibles$eng, bibles$deu, method = res))

# results are much better
# sorted co-occurrences for the english word "your" in German:
sort(sim["your",], decreasing = TRUE)[1:10]


# ----- look for 'best' translations -----

# note that selecting the 'best' takes even more time
system.time(sim2 <- sim.words(bibles$eng, bibles$deu, method = res, best = TRUE))

# best co-occurrences for the English word "your"
which(sim2$best["your",])

# but can be made faster by removing low values
# (though the boundary in \code{tol =  5} depends on the method used
system.time(sim3 <- sim.words(bibles$eng, bibles$deu, best = TRUE, method = res, tol = 5))

# note that the decision on the 'best' remains the same here
all.equal(sim2$best, sim3$best)
# }

# ----- computations also work with other languages -----

# All works completely language-independent
# translations for 'we' in Tagalog:
sim <- sim.words(bibles$eng, bibles$tgl, best = TRUE, weight = idf, tol = 0.1)
which(sim$best["we",])

Run the code above in your browser using DataLab