textrank_candidates_lsh: Use locality-sensitive hashing to get combinations of sentences which contain words which are in the same minhash bucket

Description

This functionality is usefull if there are a lot of sentences and most of the sentences have no overlapping words in there. In order not to compute the jaccard distance among all possible combinations of sentences as is done by using textrank_candidates_all, we can reduce the combinations of sentences by using the Minhash algorithm. This function sets up the combinations of sentences which are in the same Minhash bucket.

Usage

textrank_candidates_lsh(x, sentence_id, minhashFUN, bands)

Arguments

a character vector of words or terms

sentence_id

a character vector of identifiers of sentences where the words/terms provided in x are part of the sentence. The length of sentence_id should be the same length of x

minhashFUN

a function which returns a minhash of a character vector. See the examples or look at minhash_generator

bands

integer indicating to break down the minhashes in bands number of bands. Mark that the number of minhash signatures should always be a multiple of the number of local sensitive hashing bands. See the example

Value

a data.frame with 2 columns textrank_id_1 and textrank_id_2 containing identifiers of sentences sentence_id which contained terms in the same minhash bucket. This data.frame can be used as input in the textrank_sentences algorithm.

Examples

Run this code

# NOT RUN {
library(textreuse)
library(udpipe)
lsh_probability(h = 1000, b = 500, s = 0.1) # A 10 percent Jaccard overlap will be detected well

minhash <- minhash_generator(n = 1000, seed = 123456789)

data(joboffer)
joboffer$textrank_id <- unique_identifier(joboffer, c("doc_id", "paragraph_id", "sentence_id"))
sentences <- unique(joboffer[, c("textrank_id", "sentence")])
terminology <- subset(joboffer, upos %in% c("NOUN", "ADJ"), select = c("textrank_id", "lemma"))
candidates <- textrank_candidates_lsh(x = terminology$lemma, sentence_id = terminology$textrank_id,
                                      minhashFUN = minhash, bands = 500)
head(candidates)
tr <- textrank_sentences(data = sentences, terminology = terminology,
                         textrank_candidates = candidates)
summary(tr, n = 2)
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples