cleanNLP-package: cleanNLP: A Tidy Data Model for Natural Language Processing

Description

Provides a set of fast tools for converting a textual corpus into a set of normalized tables. The underlying natural language processing pipeline utilizes either the Python module spaCy or the Java-based Stanford CoreNLP library. The Python option is faster and generally easier to install; the Java option has additional annotators that are not available in spaCy.

Arguments

Details

Once the package is set up, run one of cnlp_init_tokenizers, cnlp_init_spacy, or cnlp_init_corenlp to load the desired NLP backend. After this function is done running, use cnlp_annotate to run the annotation engine over a corpus of text. Functions are then available to extract data tables from the annotation object: cnlp_get_token, cnlp_get_dependency, cnlp_get_document, cnlp_get_coreference, cnlp_get_entity, cnlp_get_sentence, and cnlp_get_vector. See their documentation for further details. The package vignettes provide more detailed set-up information.

If loading annotation that have previously been saved to disk, these can be pulled back into R using cnlp_read_csv. This does not require Java or Python nor does it require initializing the annotation pipeline.

Examples

Run this code

# NOT RUN {
# }
# NOT RUN {
# load the annotation engine (can also use spaCy and coreNLP backends)
setup_tokenizers_backend()
init_backend(type = "tokenizers")

# annotate your text
annotation <- run_annotators("path/to/corpus/directory")

# pull off data tables
token <- cnlp_get_token(annotation)
dependency <- cnlp_get_dependency(annotation)
document <- cnlp_get_document(annotation)
coreference <- cnlp_get_coreference(annotation)
entity <- cnlp_get_entity(annotation)
sentiment <- cnlp_get_sentence(annotation)
vector <- cnlp_get_vector(annotation)
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab

Description

Arguments

Details

See Also

Examples