Extracts N-Grams and phrases from a collection od documents that has been preprocessed by the corenlp() function.
ngrams(tokenized_documents = NULL, tokenized_documents_directory = NULL,
output_directory = NULL, file_list = NULL, ngram_lengths = c(1, 2, 3),
remove_punctuation = TRUE, remove_numeric = TRUE, JK_filtering = FALSE,
verb_filtering = FALSE, phrase_extraction = FALSE,
return_tag_patterns = FALSE, lemmatize = FALSE, lowercase = FALSE,
parallel = FALSE, cores = 2)
An optional list object output by the corenlp() or corenlp_blocked() functions containing tokenized document dataframes (one per document).
An optional directory path to a directory contianing CoreNLP_Output_x.Rdata files output by the corenlp() or corenlp_blocked() functions. Cannot be supplied in addition to the 'tokenized_documents' argument.
If a tokenized_documents_directory is provided, an alternate output directory may be provided in which case n-gram extractions for each block are saved in the alternative directory If NULL, then output_directory will be set to tokenized_documents_directory.
An optional list of CoreNLP_Output_x.Rdata files to be used if tokenized_documents_directory option is specified. Can be useful if the user only wants to process a subset of documents in the directory such as when the corpus is extremely large.
A vector of N-gram lengths (in tokens) to be returned by the function. Defaults to c(1,2,3) which returns all unigrams, bigrams and trigrams.
Removes any N-Grams with atleast one token containing one or more punctuation characters: [!"#$ ^_`|~]
Removes any N-Grams with atleast one token containing one or more numerals (0-9).
Defaults to FALSE. If TRUE then bigrams and trigrams will be extracted, and filtered according to the tag patterns described in Justeson, John S., and Slava M. Katz. "Technical terminology: some linguistic properties and an algorithm for identification in text." Natural language engineering 1.01 (1995): 9-27. Available: https://brenocon.com/JustesonKatz1995.pdf. The POS tag patterns used are: AN, NN, AAN, ANN, NAN, NNN, NPN.
Defaults to FALSE. If TRUE, then short verb phrases will be extracted in a manner similar to that described in JK_filtering above. The POS tag patterns used are: VN, VAN, VNN, VPN, ANV, VDN.
Defaults to FALSE. If TRUE, then full phrases of arbitrary length will be extracted following the procedure described in Denny, O'Connor, and Wallach (2016). This method will produce the most phrases, of highest quality, but will take significantly longer than other methods. Not currently implemented.
Defaults to FALSE. If TRUE and either JK_filtering = TRUE, verb_filtering = TRUE, or phrase_extraction = TRUE, then the tag pattern matched in forming the n-gram/phrase will be returned as an accompanying vector.
If TRUE, then n-grams are constructed out of lemmatized tokens, Defaults to FALSE.
If TRUE, all n-grams are lowercased before being returned. Defaults to TRUE.
Logical: should documents be processed in parallel? Defaults to FALSE.
Number of cores to be used if parallel = TRUE, defaults to 2.
Returns a list of lists (one list per document) with entries for n-grams of each size specified in the ngram_lengths argument. May also return metadata if return_tag_patterns = TRUE.
# NOT RUN {
data("Processed_Text")
NGrams <- ngrams(tokenized_documents = Processed_Text,
ngram_lengths = c(1,2,3),
remove_punctuation = TRUE,
remove_numeric = TRUE,
lowercase = TRUE,
parallel = FALSE,
cores = 1)
# }
Run the code above in your browser using DataLab