check_text
- Uncleaned text may result in errors, warnings, and
incorrect results in subsequent analysis. check_text
checks text for
potential problems and suggests possible fixes. Potential text anomalies
that are detected include: factors, missing ending punctuation, empty cells,
double punctuation, non-space after comma, no alphabetic characters,
non-ASCII, missing value, and potentially misspelled words.
available_check
- Provide a data.frame view of all the available
checks in the check_text
function.
check_text(x, file = NULL, checks = NULL, n = 10, ...)available_checks()
The text variable.
A connection, or a character string naming the file to print to.
If NULL
prints to the console. Note that this is assigned as an
attribute and passed to print
.
A vector of checks to include from which_are
. If
checks = NULL
, all checks from which_are
which be used. Note
that all meta checks will be conducted (see which_are
for details on
meta checks).
The number of affected elements to print out (the rest are truncated).
ignored.
Returns a list with the following potential text faults report:
contraction- Text elements that contain contractions
date- Text elements that contain dates
digit- Text elements that contain digits/numbers
email- Text elements that contain email addresses
emoticon- Text elements that contain emoticons
empty- Text elements that contain empty text cells (all white space)
escaped- Text elements that contain escaped back spaced characters
hash- Text elements that contain Twitter style hash tags (e.g., #rstats)
html- Text elements that contain HTML markup
incomplete- Text elements that contain incomplete sentences (e.g., uses ending punctuation like ...)
kern- Text elements that contain kerning (e.g., 'The B O M B!')
list_column- Text variable that is a list column
missing_value- Text elements that contain missing values
misspelled- Text elements that contain potentially misspelled words
no_alpha- Text elements that contain elements with no alphabetic (a-z) letters
no_endmark- Text elements that contain elements with missing ending punctuation
no_space_after_comma- Text elements that contain commas with no space afterwards
non_ascii- Text elements that contain non-ASCII text
non_character- Text variable that is not a character column (likely factor
)
non_split_sentence- Text elements that contain unsplit sentences (more than one sentence per element)
tag- Text elements that contain Twitter style handle tags (e.g., @trinker)
time- Text elements that contain timestamps
url- Text elements that contain URLs