lineByLine: Line-by-line modification of files

Description

Modifies a data file line by line, i.e. reads a file line by line, converts each line, then writes to the modified file. This method is especially useful when modifying large datasets, where the reading of entire files may be time consuming and require a large amount of memory.

Usage

lineByLine(infile, outfile, linefunc = identity, choose.lines = NULL,
choose.columns = NULL, col.sep = " ", ask = TRUE, 
blank.lines.skip = TRUE, verbose = TRUE, ...)

Value

lineByLine returns the number of lines read, although invisible. The main objective is the modified file.

Arguments

infile: A character string giving the name and path of the file to be modified.
outfile: A character string giving the name of the modified file. The name of the file is relative to the current working directory, unless the file name contains a definite path.
linefunc: lineByLine modifies each line using linefunc. Default is the identity function. The user may define his or her own line-modifying functions, see Details for a thorough description.
choose.lines: A numeric vector of lines to be selected or dropped from infile. Positive values refer to lines to be chosen, whereas negative values refer to lines to be skipped. The vector cannot include both positive and negative values at the same time. If "NULL" (default), all lines are selected.
choose.columns: A numeric vector of columns to be selected (positive values) or skipped (negative values) from infile. The vector cannot include both positive and negative values at the same time. By default, all columns are selected without reordering among the columns. Duplication and reordering among the selected columns will occur in the modified file corresponding to the order in which the columns are listed.
col.sep: Specifies the separator that splits the columns in infile. By default, col.sep = " " (space). To split at all types of spaces or blank characters, set col.sep = "[[:space:]]" or col.sep = "[[:blank:]]".
ask: Logical. Default is "TRUE". If set to "FALSE", an already existing outfile will be overwritten without asking.
blank.lines.skip: Logical. If "TRUE" (default), lineByLine ignores blank lines in the input.
verbose: Logical. Default is "TRUE", which means that the line number is displayed for each iteration, in addition to output from linefunc. If choose.columns contains invalid column numbers, this will also be displayed.
...: Further arguments to be passed to linefunc.

Author

Miriam Gjerdevik,
with Hakon K. Gjessing
Professor of Biostatistics
Division of Epidemiology
Norwegian Institute of Public Health

hakon.gjessing@uib.no

Details

When reading large datafiles, functions such as read.table can use a large amount of memory and be extremely time consuming. Instead of reading the entire file at once, lineByLine reads one line at a time, modifies the line using linefunc, and then writes the line to outfile.
The user may specify his or her own line-converting function. This function must take the argument x, a character vector representing a single line of the file, split at spaces. However, additional arguments may be included. If verbose equals "TRUE", output should be displayed. The modified vector is returned.
The framework of the line-modifying function may look something like this:


lineModify <- function(x){
.xnew <- x
## Define any modifications, for instance recoding missing values in a dataset from NA to 0:
.xnew[is.na(.xnew)] <- 0
## Just to monitor progress, display, for instance, 10 first elements, without newline:
cat(paste(.xnew[1:min(10, length(.xnew))], collapse = " "))
## Return converted vector
return(.xnew)
}

See Haplin:::lineConvert for an additional example of a line-modifying function.

References

Web Site: https://haplin.bitbucket.io

Examples

Run this code

if (FALSE) {

## Extract the first ten columns from "myfile.txt", 
## without reordering
lineByLine(infile = "myfile.txt", outfile = "myfile_modified.txt", 
choose.columns = c(1:10))

}

Run the code above in your browser using DataLab