- directory
The directory to perform the search for pdf files to search.
- keyword
The keyword(s) to be used to search in the text. Multiple
keywords can be specified with a character vector.
- surround_lines
numeric/FALSE indicating whether the output should
extract the surrouding lines of text in addition to the matching line.
Default is FALSE, if not false, include a numeric number that indicates
the additional number of surrounding lines that will be extracted.
- ignore_case
TRUE/FALSE/vector of TRUE/FALSE, indicating whether the
case of the keyword matters.
Default is FALSE meaning that case of the keyword is literal. If a vector,
must be same length as the keyword vector.
- token_results
TRUE/FALSE indicating whether the results text returned
should be split into tokens. See the tokenizers package and
convert_tokens
for more details. Defaults to TRUE.
- split_pdf
TRUE/FALSE indicating whether to split the pdf using white
space. This would be most useful with multicolumn pdf files.
The split_pdf function attempts to recreate the column layout of the text
into a single column starting with the left column and proceeding to the
right.
- remove_hyphen
TRUE/FALSE indicating whether hyphenated words should
be adjusted to combine onto a single line. Default is TRUE.
- convert_sentence
TRUE/FALSE indicating if individual lines of PDF file
should be collapsed into a single large paragraph to perform keyword
searching. Default is TRUE.
- remove_equations
TRUE/FALSE indicating if equations should be removed.
Default behavior is to search for a literal parenthesis,
followed by at least one number followed by another parenthesis at
the end of the text line. This will not detect other patterns or
detect the entire equation if it is a multi-row equation.
- split_pattern
Regular expression pattern used to split multicolumn
PDF files using stringi::stri_split_regex
.
Default pattern is to
split based on three or more consecutive white space characters.
- full_names
TRUE/FALSE indicating if the full file path should be used.
Default is TRUE, see list.files
for more details.
- file_pattern
An optional regular expression to select specific file
names. Only files that match the regular expression will be searched.
Defaults to all pdfs, i.e. ".pdf"
. See list.files
for more details.
- recursive
TRUE/FALSE indicating if subdirectories should be searched
as well.
Default is FALSE, see list.files
for more details.
- max_search
An optional numeric vector indicating the maximum number
of pdfs to search. Will only search the first n cases.
- ...
token_function to pass to convert_tokens
function.