ngrams: Extracts N-Grams and phrases from a collection od documents that has been preprocessed by the corenlp() function.

Description

Extracts N-Grams and phrases from a collection od documents that has been preprocessed by the corenlp() function.

Usage

ngrams(tokenized_documents = NULL, tokenized_documents_directory = NULL,
  output_directory = NULL, file_list = NULL, ngram_lengths = c(1, 2, 3),
  remove_punctuation = TRUE, remove_numeric = TRUE, JK_filtering = FALSE,
  verb_filtering = FALSE, phrase_extraction = FALSE,
  return_tag_patterns = FALSE, lemmatize = FALSE, lowercase = FALSE,
  parallel = FALSE, cores = 2)

Arguments

tokenized_documents

An optional list object output by the corenlp() or corenlp_blocked() functions containing tokenized document dataframes (one per document).

tokenized_documents_directory

An optional directory path to a directory contianing CoreNLP_Output_x.Rdata files output by the corenlp() or corenlp_blocked() functions. Cannot be supplied in addition to the 'tokenized_documents' argument.

output_directory

If a tokenized_documents_directory is provided, an alternate output directory may be provided in which case n-gram extractions for each block are saved in the alternative directory If NULL, then output_directory will be set to tokenized_documents_directory.

file_list

An optional list of CoreNLP_Output_x.Rdata files to be used if tokenized_documents_directory option is specified. Can be useful if the user only wants to process a subset of documents in the directory such as when the corpus is extremely large.

ngram_lengths

A vector of N-gram lengths (in tokens) to be returned by the function. Defaults to c(1,2,3) which returns all unigrams, bigrams and trigrams.

remove_punctuation

Removes any N-Grams with atleast one token containing one or more punctuation characters: [!"#$ ^_`|~]

remove_numeric

Removes any N-Grams with atleast one token containing one or more numerals (0-9).

JK_filtering

Defaults to FALSE. If TRUE then bigrams and trigrams will be extracted, and filtered according to the tag patterns described in Justeson, John S., and Slava M. Katz. "Technical terminology: some linguistic properties and an algorithm for identification in text." Natural language engineering 1.01 (1995): 9-27. Available: https://brenocon.com/JustesonKatz1995.pdf. The POS tag patterns used are: AN, NN, AAN, ANN, NAN, NNN, NPN.

verb_filtering

Defaults to FALSE. If TRUE, then short verb phrases will be extracted in a manner similar to that described in JK_filtering above. The POS tag patterns used are: VN, VAN, VNN, VPN, ANV, VDN.

phrase_extraction

Defaults to FALSE. If TRUE, then full phrases of arbitrary length will be extracted following the procedure described in Denny, O'Connor, and Wallach (2016). This method will produce the most phrases, of highest quality, but will take significantly longer than other methods. Not currently implemented.

return_tag_patterns

Defaults to FALSE. If TRUE and either JK_filtering = TRUE, verb_filtering = TRUE, or phrase_extraction = TRUE, then the tag pattern matched in forming the n-gram/phrase will be returned as an accompanying vector.

lemmatize

If TRUE, then n-grams are constructed out of lemmatized tokens, Defaults to FALSE.

lowercase

If TRUE, all n-grams are lowercased before being returned. Defaults to TRUE.

parallel

Logical: should documents be processed in parallel? Defaults to FALSE.

cores

Number of cores to be used if parallel = TRUE, defaults to 2.

Value

Returns a list of lists (one list per document) with entries for n-grams of each size specified in the ngram_lengths argument. May also return metadata if return_tag_patterns = TRUE.

Examples

Run this code

# NOT RUN {
data("Processed_Text")
NGrams <- ngrams(tokenized_documents = Processed_Text,
                 ngram_lengths = c(1,2,3),
                 remove_punctuation = TRUE,
                 remove_numeric = TRUE,
                 lowercase = TRUE,
                 parallel = FALSE,
                 cores = 1)
# }

Run the code above in your browser using DataLab