Learn R Programming

SpeedReader (version 0.9.1)

clean_document_text: A function which cleans the raw text of a document provided either as a single string, a vector of strings, or a column of a data.frame.

Description

A function which cleans the raw text of a document provided either as a single string, a vector of strings, or a column of a data.frame.

Usage

clean_document_text(text, regex = "[^a-zA-Z\\s]")

Arguments

text

The raw text of a document the user wishes to clean. Can be supplied as either a single string, a vector of strings, or a column from a data.frame.

regex

A regular expression specifying the characters the user would like to EXCLUDE from the final text string. This function works by replacing those terms with spaces and then splitting the resulting string on those spaces. Defaults to removing all characters that are not uper or lowercase letters or spaces (as a regex, this is "[^a-zA-Z\s]").

Value

A document-term vector with ordering preserved.