dstrfw: Split fixed width input into a dataframe

Description

dstrfw takes raw or character vector and splits it into a dataframe according to a vector of fixed widths.

Usage

dstrfw(x, col_types, widths, nsep = NA, strict=TRUE, skip=0L, nrows=-1L)

Value

If nsep is specified then all characters up to (but excluding) the occurrence of nsep are treated as the index name. The remaining characters are split using the widths vector into fields (columns). dstrfw will fail with an error if any line does not contain enough characters to fill all expected columns, unless strict is FALSE. Excessive columns are ignored in that case. Lines may contain fewer columns (but not partial ones unless strict is FALSE) in which case they are set to

NA.

dstrfw returns a data.frame with as many rows as they are lines in the input and as many columns as there are non-NA values in col_types, plus an additional column if

nsep is specified. The colnames (other than the row index) are set to 'V' concatenated with the column number unless

col_types is a named vector in which case the names are inherited.

Arguments

x

character vector (each element is treated as a row) or a raw vector (newlines separate rows)

col_types

required character vector or a list. A vector of classes to be assumed for the output dataframe. If it is a list, class(x)[1] will be used to determine the class of the contained element. It will not be recycled, and must be at least as long as the longest row if strict is TRUE.

Possible values are "NULL" (when the column is skipped) one of the six atomic vector types ('character', 'numeric', 'logical', 'integer', 'complex', 'raw') or POSIXct. 'POSIXct' will parse date format in the form "YYYY-MM-DD hh:mm:ss.sss" assuming GMT time zone. The separators between digits can be any non-digit characters and only the date part is mandatory. See also fasttime::asPOSIXct for details.

widths

a vector of widths of the columns. Must be the same length as col_types.

nsep

index name separator (single character) or NA if no index names are included

strict

logical, if FALSE then dstrsplit will not fail on parsing errors, otherwise input not matching the format (e.g. more columns than expected) will cause an error.

skip

integer: the number of lines of the data file to skip before beginning to read data.

nrows

integer: the maximum number of rows to read in. Negative and other invalid values are ignored.

Author

Taylor Arnold and Simon Urbanek

Details

If nsep is specified, the output of dstrsplit contains an extra column called 'rowindex' containing the row index. This is used instead of the rownames to allow for duplicated indicies (which are checked for and not allowed in a dataframe, unlike the case with a matrix).

Examples

Run this code

input = c("bear\t22.7horse+3", "pear\t 3.4mouse-3", "dogs\t14.8prime-8")
z = dstrfw(x = input, col_types = c("numeric", "character", "integer"),
      width=c(4L,5L,2L), nsep="\t")
z

# Now without row names (treat seperator as a 1 char width column with type NULL)
z = dstrfw(x = input,
    col_types = c("character", "NULL", "numeric", "character", "integer"),
    width=c(4L,1L,4L,5L,2L))
z

Run the code above in your browser using DataLab