read.csv.ddmatrix: A Simple Parallel CSV Reader

Description

Read in a table from a CSV file in parallel as a distributed matrix.

Usage

read.csv.ddmatrix(file, sep = ",", nrows, ncols, header = FALSE, bldim = 4, num.rdrs = 1, ICTXT = 0, exact.linecount = TRUE)

Arguments

file

csv file name.

sep

separator character.

nrows, ncols

dimensions of the csv file. Allowed to be missing in function call.

header

logical indicating presence/absence of character header for file.

bldim

the blocking dimension for block-cyclically distributing the matrix across the process grid

num.rdrs

numer of processes to be used to read in the table

ICTXT

BLACS context number for return

exact.linecount

linecount In the event that nrows is missing, this determines whether or not the exact number of rows should be determined (which requires a file read), or if an estimate should be used. Default is TRUE, meaning that the file will be scanned.

Value

Returns a distributed matrix.

Details

The function reads in data from a csv file into a distributed matrix. This function sits somewhere between scan() and read.csv(), but for parallel reads into a distributed matrix.

The arguments nrow= and ncol= are optional. In the case that they are left blank, they will be determined. However, note that doing so is costly, so knowing the dimensions beforehand can greatly improve performance.

Although frankly, the performance-minded should not be using csv's in the first place. Consider using the pbdNCDF4 package for managing data.