Modifies a data file line by line, i.e. reads a file line by line, converts each line, then writes to the modified file. This method is especially useful when modifying large datasets, where the reading of entire files may be time consuming and require a large amount of memory.
lineByLine(infile, outfile, linefunc = identity, choose.lines = NULL,
choose.columns = NULL, col.sep = " ", ask = TRUE,
blank.lines.skip = TRUE, verbose = TRUE, ...)
lineByLine
returns the number of lines read, although invisible. The main objective is the modified file.
A character string giving the name and path of the file to be modified.
A character string giving the name of the modified file. The name of the file is relative to the current working directory, unless the file name contains a definite path.
lineByLine
modifies each line using linefunc
. Default is the identity function. The user may define his or her own line-modifying functions, see Details for a thorough description.
A numeric vector of lines to be selected or dropped from infile
. Positive values refer to lines to be chosen, whereas negative values refer to lines to be skipped. The vector cannot include both positive and negative values at the same time. If "NULL" (default), all lines are selected.
A numeric vector of columns to be selected (positive values) or skipped (negative values) from infile
. The vector cannot include both positive and negative values at the same time. By default, all columns are selected without reordering among the columns. Duplication and reordering among the selected columns will occur in the modified file corresponding to the order in which the columns are listed.
Specifies the separator that splits the columns in infile
. By default, col.sep = " "
(space). To split at all types of spaces or blank characters, set col.sep = "[[:space:]]"
or col.sep = "[[:blank:]]"
.
Logical. Default is "TRUE". If set to "FALSE", an already existing outfile will be overwritten without asking.
Logical. If "TRUE" (default), lineByLine
ignores blank lines in the input.
Logical. Default is "TRUE", which means that the line number is displayed for each iteration, in addition to output from linefunc
. If choose.columns
contains invalid column numbers, this will also be displayed.
Further arguments to be passed to linefunc
.
Miriam Gjerdevik,
with Hakon K. Gjessing
Professor of Biostatistics
Division of Epidemiology
Norwegian Institute of Public Health
When reading large datafiles, functions such as read.table
can use a large amount of memory and be extremely time consuming.
Instead of reading the entire file at once, lineByLine
reads one line at a time, modifies the line using linefunc
, and then writes the line to outfile
.
The user may specify his or her own line-converting function. This function must take the argument x
, a character vector representing a single line of the file, split at spaces. However, additional arguments may be included.
If verbose
equals "TRUE", output should be displayed.
The modified vector is returned.
The framework of the line-modifying function may look something like this:
lineModify <- function(x){
.xnew <- x## Define any modifications, for instance recoding missing values in a dataset from NA to 0:
.xnew[is.na(.xnew)] <- 0
## Just to monitor progress, display, for instance, 10 first elements, without newline:
cat(paste(.xnew[1:min(10, length(.xnew))], collapse = " "))
## Return converted vector
return(.xnew)
}
See Haplin:::lineConvert
for an additional example of a line-modifying function.
Web Site: https://haplin.bitbucket.io
convertPed
if (FALSE) {
## Extract the first ten columns from "myfile.txt",
## without reordering
lineByLine(infile = "myfile.txt", outfile = "myfile_modified.txt",
choose.columns = c(1:10))
}
Run the code above in your browser using DataLab