corenlp: Runs Stanford CoreNLP on a collection of documents

Description

Runs Stanford CoreNLP on a collection of documents

Usage

corenlp(documents = NULL, document_directory = NULL, file_list = NULL,
  delete_intermediate_files = TRUE, syntactic_parsing = FALSE,
  coreference_resolution = FALSE, additional_options = "",
  return_raw_output = FALSE, version = "3.5.2", block = 1)

Arguments

documents

An optional list of character vectors or a vector of strings, with one entry per dcument. These documents will be run through CoreNLP.

document_directory

An optional directory path to a directory contianing only .txt files (one per document) to be run through CoreNLP. Cannot be supplied in addition to the 'documents' argument.

file_list

An optional list of .txt files to be used if document_directory option is specified. Can be useful if the user only wants to process a subset of documents in the directory such as when the corpus is extremely large.

delete_intermediate_files

Logical indicating whether intermediate files produced by CoreNLP should be deleted. Defaults to TRUE, but can be set to FALSE and the xml output of CoreNLP will be saved.

syntactic_parsing

Logical indicating whether syntactic parsing should be included as an option. Defaults to FALSE. Caution, enabling this argument may greatly increase runtime. If TRUE, output will automatically be return in raw format.

coreference_resolution

Logical indicating whether coreference resolution should be included as an option. Defaults to FALSE. Caution, enabling this argument may greatly increase runtime. If TRUE, output will automatically be return in raw format.

additional_options

An optional string specifying additional options for CoreNLP. May cause unexpected behavior, use at your own risk!

return_raw_output

Defaults to FALSE, if TRUE, then CoreNLP output is not parsed and raw list objects are returned.

version

The version of Core-NLP to download. Defaults to '3.5.2'. Newer versions of CoreNLP will be made available at a later date.

block

An internal file list identifier used by corenlp_blocked() to avoid collisions. Should not be set by the user.

Value

Returns a list of data.frame objects, one per document, where each row is a token observation (in order)

Examples

Run this code

# NOT RUN {
directory <- system.file("extdata", package = "SpeedReader")[1]
Tokenized <- corenlp(
     document_directory = directory,
     syntactic_parsing = FALSE,
     coreference_resolution =FALSE)
# }

Run the code above in your browser using DataLab