- pdfs
String. A list of paths to the PDF files to be analyzed.
- out
String. Directory chosen to save analysis results in. Default:
"."
.
- filter.words
List of strings. The list of filter words. If not
NA
or ""
a hit will be counted every time a word from the list
is detected in the article.
Default: ""
.
- regex.fw
Logical. If TRUE filter words will follow the regex rules
(see https://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).
Default = TRUE
.
- ignore.case.fw
Logical. Are the filter words case-sensitive (does
capitalization matter)? Default: FALSE
.
- filter.word.times
Numeric or string. Can either be expressed as absolute number or percentage
of the total number of words (by adding the "
filter.words
for a paper to be further analyzed. Default: 0.2%
.
- table.heading.words
List of strings. Different than standard (TABLE,
TAB or table plus number) headings to be detected. Regex rules apply (see
also
https://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).
Default = ""
.
- ignore.case.th
Logical. Are the additional table headings (see
table.heading.words
) case-sensitive (does capitalization matter)?
Default = FALSE
.
- search.words
List of strings. List of search words. To extract all
tables from the PDF file leave search.words = ""
.
- search.word.categories
List of strings. List of categories with the
same length as the list of search words. Accordingly, each search word can be
assigned to a category, of which the word counts will be summarized in the
PDE_analyzer_word_stats.csv
file. If search.word.categories is a
different length than search.words the parameter will be ignored.
Default: NULL
.
- save.tab.by.category
Logical. Can only be used with search.word.categories.
If set to TRUE, tables that carry search words will be saved in sub-folders
according to the search word category of the detected search word.
Default: FALSE
.
- regex.sw
Logical. If TRUE search words will follow the regex rules
(see https://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).
Default = TRUE
.
- ignore.case.sw
Logical. Are the search words case-sensitive (does
capitalization matter)? Default: FALSE
.
- eval.abbrevs
Logical. Should abbreviations for the search words be
automatically detected and then replaced with the search word + "$*"?
Default: TRUE
.
- out.table.format
String. Output file format. Either comma separated
file .csv
or tab separated file .tsv
. The encoding indicated
in parantheses should be selected according to the operational system
exported tables are opened in, i.e., Windows: "(WINDOWS-1252)"
; Mac:
(macintosh)
; Linux: (UTF-8)
. Default: ".csv"
and
encoding depending on the operational system.
- dev_x
Numeric. For a table the size of indention which would be
considered the same column. Default: 20
.
- dev_y
Numeric. For a table the vertical distance which would be
considered the same row. Can be either a number or set to dynamic detection
[9999], in which case the font size is used to detect which words are in the
same row.
Default: 9999
.
- write.table.locations
Logical. If TRUE
, a separate file with the
headings of all tables, their relative location in the generated html and
txt files, as well as information if search words were found will be
generated. Default: FALSE
.
- exp.nondetc.tabs
Logical. If TRUE
, if a table was detected in a
PDF file but is an image or cannot be read, the page with the table with be
exported as a png. Default: TRUE
.
- write.tab.doc.file
Logical. If TRUE
, if search words are used
for table detection and no search words were found in the tables of a PDF
file, a no.table.w.search.words. Default: TRUE
.
- delete
Logical. If TRUE
, the intermediate txt,
keeplayouttxt and html copies of the PDF file will be
deleted. Default: TRUE
.
- cpy_mv
String. Either "nocpymv", "cpy", or "mv". If filter words are used in the
analyses, the processed PDF files will either be copied ("cpy") or moved ("mv") into the
/pdf/ subfolder of the output folder. Default: "nocpymv"
.
- verbose
Logical. Indicates whether messages will be printed in the console. Default: TRUE
.