text_file_parser: text file parser

Description

text file parser

Usage

text_file_parser(input_path_file = NULL, output_path_file = "",
  start_query = NULL, end_query = NULL, min_lines = 1,
  trimmed_line = FALSE, verbose = FALSE)

Arguments

input_path_file

either a path to an input file or a vector of character strings ( normally the latter would represent ordered lines of a text file in form of a character vector )

output_path_file

either an empty character string ("") or a character string specifying a path to an output file ( it applies only if the input_path_file parameter is a valid path to a file )

start_query

a character string or a vector of character strings. The start_query (if it's a single character string) is the first word of the subset of the data and should appear frequently at the beginning of each line in the text file.

end_query

a character string or a vector of character strings. The end_query (if it's a single character string) is the last word of the subset of the data and should appear frequently at the end of each line in the text file.

min_lines

a numeric value specifying the minimum number of lines ( applies only if the input_path_file is a valid path to a file) . For instance if min_lines = 2, then only subsets of text with more than 1 lines will be pre-processed.

trimmed_line

either TRUE or FALSE. If FALSE then each line of the text file will be trimmed both sides before applying the start_query and end_query

verbose

either TRUE or FALSE. If TRUE then information will be printed in the console

Details

The text file should have a structure (such as an xml-structure), so that subsets can be extracted using the start_query and end_query parameters ( the same applies in case of a vector of character strings)

Examples

Run this code

# NOT RUN {
library(textTinyR)

# In case that the 'input_path_file' is a valid path
#---------------------------------------------------
 
# fp = text_file_parser(input_path_file = '/folder/input_data.txt',

#                       output_path_file = '/folder/output_data.txt',

#                       start_query = 'word_a', end_query = 'word_w',

#                       min_lines = 1, trimmed_line = FALSE)
                     
                     
# In case that the 'input_path_file' is a character vector of strings
#--------------------------------------------------------------------

#  PATH_url = "https://FILE.xml"
  
#  con = url(PATH_url, method = "libcurl")
  
#  tmp_dat = read.delim(con, quote = "\"", comment.char = "", stringsAsFactors = FALSE)
  
#  vec_docs = unlist(lapply(1:length(as.vector(tmp_dat[, 1])), function(x) 

#                    trimws(tmp_dat[x, 1], which = "both")))
  
#  parse_data = text_file_parser(input_path_file = vec_docs,
  
#                                start_query = c("<query1>", "<query2>", "<query3>"),
  
#                                end_query = c("</query1>", "</query2>", "</query3>"), 
  
#                                min_lines = 1, trimmed_line = TRUE)
# }

Run the code above in your browser using DataLab