PDE_pdfs2txt_searchandfilter: Extracting sentences from a PDF (Portable Document Format) file

Description

PDE_pdfs2txt_searchandfilter extracts sentences from a single PDF file according to search and filter words and writes output in the corresponding folder.

Usage

PDE_pdfs2txt_searchandfilter(
  pdfs,
  out = ".",
  filter.words = "",
  regex.fw = TRUE,
  ignore.case.fw = FALSE,
  filter.word.times = "0.2%",
  search.words,
  search.word.categories = NULL,
  regex.sw = TRUE,
  ignore.case.sw = FALSE,
  eval.abbrevs = TRUE,
  out.table.format = ".csv (WINDOWS-1252)",
  context = 0,
  write.txt.doc.file = TRUE,
  delete = TRUE,
  cpy_mv = "nocpymv",
  verbose = TRUE
)

Arguments

pdfs: String. A list of paths to the PDF files to be analyzed.
out: String. Directory chosen to save analysis results in. Default: ".".
filter.words: List of strings. The list of filter words. If not NA or "" a hit will be counted every time a word from the list is detected in the article. Default: "".
regex.fw: Logical. If TRUE filter words will follow the regex rules (see https://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf). Default = TRUE.
ignore.case.fw: Logical. Are the filter words case-sensitive (does capitalization matter)? Default: FALSE.
filter.word.times: Numeric or string. Can either be expressed as absolute number or percentage of the total number of words (by adding the " filter.words for a paper to be further analyzed. Default: 0.2%.
search.words: List of strings. List of search words.
search.word.categories: List of strings. List of categories with the same length as the list of search words. Accordingly, each search word can be assigned to a category, of which the word counts will be summarized in the PDE_analyzer_word_stats.csv file. If search.word.categories is a different length than search.words the parameter will be ignored. Default: NULL.
regex.sw: Logical. If TRUE search words will follow the regex rules (see https://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf). Default = TRUE.
ignore.case.sw: Logical. Are the search words case-sensitive (does capitalization matter)? Default: FALSE.
eval.abbrevs: Logical. Should abbreviations for the search words be automatically detected and then replaced with the search word + "$*"? Default: TRUE.
out.table.format: String. Output file format. Either comma separated file .csv or tab separated file .tsv. The encoding indicated in parantheses should be selected according to the operational system exported tables are opened in, i.e., Windows: "(WINDOWS-1252)"; Mac: (macintosh); Linux: (UTF-8). Default: ".csv" and encoding depending on the operational system.
context: Numeric. Number of sentences extracted before and after the sentence with the detected search word. If 0 only the sentence with the search word is extracted. Default: 0.
write.txt.doc.file: Logical. If TRUE, if no search words were found in the sentences of a PDF file, a file will be created with the PDF filename followed by no.txt.w.search.words. If the PDF file is empty, a file will be created with the PDF filename followed by no.content.detected. If the filter word threshold is not met, a file will be created with the PDF filename followed by no.txt.w.filter.words. Default: TRUE.
delete: Logical. If TRUE, the intermediate txt, keeplayouttxt and html copies of the PDF file will be deleted. Default: TRUE.
cpy_mv: String. Either "nocpymv", "cpy", or "mv". If filter words are used in the analyses, the processed PDF files will either be copied ("cpy") or moved ("mv") into the /pdf/ subfolder of the output folder. Default: "nocpymv".
verbose: Logical. Indicates whether messages will be printed in the console. Default: TRUE.

Examples

Run this code

## Running a simple analysis with filter and search words to extract sentences
if(PDE_check_Xpdf_install() == TRUE){
 outputtables <- PDE_pdfs2txt_searchandfilter(pdf = paste0(system.file(package = "PDE"),
                                      "/examples/Methotrexate/29973177_!.pdf"),
 out = paste0(system.file(package = "PDE"),"/examples/MTX_txt+-0/"),
 filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]],
 regex.fw = FALSE,
 ignore.case.fw = TRUE,
 search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]],
 regex.sw = TRUE,
 ignore.case.sw = FALSE)
}

## Running an advanced analysis with filter and search words to
## extract sentences and obtain documentation files
if(PDE_check_Xpdf_install() == TRUE){
 outputtables <- PDE_pdfs2txt_searchandfilter(pdf = paste0(system.file(package = "PDE"),
                                       "/examples/Methotrexate/29973177_!.pdf"),
 out = paste0(system.file(package = "PDE"),"/examples/MTX_txt+-1/"),
 context = 1,
 filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]],
 regex.fw = FALSE,
 ignore.case.fw = TRUE,
 filter.word.times = "0.2%",
 search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]],
 regex.sw = TRUE,
 ignore.case.sw = FALSE,
 eval.abbrevs = TRUE,
 out.table.format = ".csv (WINDOWS-1252)",
 write.txt.doc.file = TRUE,
 cpy_mv = "nocpymv",
 delete = TRUE)
}

Run the code above in your browser using DataLab

Description

Usage

Arguments

See Also

Examples