The text of the pdf file. This can be specified directly
or the pdftools package is used to read the pdf file from a file path.
To use the pdftools, the path argument must be set to TRUE.
path
An optional path designation for the location of the pdf to be
converted to text. The pdftools package is used for this conversion.
split_pdf
TRUE/FALSE indicating whether to split the pdf using white
space. This would be most useful with multicolumn pdf files.
The split_pdf function attempts to recreate the column layout of the text
into a single column starting with the left column and proceeding to the
right.
remove_hyphen
TRUE/FALSE indicating whether hyphenated words should
be adjusted to combine onto a single line. Default is TRUE.
token_function
This is a function from the tokenizers package. Default
is the tokenize_words function.
Value
A list of character vectors containing the tokens. More detail can
be found looking at the documentation of the tokenizers package.