fsubset: Fast Subsetting Matrix-Like Objects

Description

fsubset returns subsets of vectors, matrices or data frames which meet conditions. It is programmed very efficiently and uses C source code from the data.table package. Especially for data frames it is significantly (4-5 times) faster than subset or dplyr::filter. The methods also provide enhanced functionality compared to subset. The function ss provides an (internal generic) programmers alternative to [ that does not drop dimensions and is significantly faster than [ for data frames.

Usage

fsubset(x, …)
sbt(x, …)     # Shortcut for fsubset
# S3 method for default
fsubset(x, subset, …)
# S3 method for matrix
fsubset(x, subset, …, drop = FALSE)
# S3 method for data.frame
fsubset(x, subset, …)
# Fast subsetting (replaces `[` with drop = FALSE, programmers choice)
ss(x, i, j)

Arguments

object to be subsetted.

subset

logical expression indicating elements or rows to keep: missing values are taken as FALSE. The default and matrix methods only support logical vectors or row-indices (or a character vector of rownames if the matrix has rownames; the data frame method also supports logical vectors or row-indices).

…

For the matrix or data frame method: multiple comma-separated expressions indicating columns to select. Otherwise: further arguments to be passed to or from other methods.

drop

passed on to [ indexing operator. Only available for the matrix method.

positive or negative row-indices or a logical vector to subset the rows of x.

a vector of column names, positive or negative indices or a suitable logical vector to subset the columns of x. Note: Negative indices are converted to positive ones using j <- seq_along(x)[j].

Value

An object similar to x containing just the selected elements (for a vector), rows and columns (for a matrix or data frame).

Details

fsubset is a generic function, with methods supplied for vectors, matrices, and data frames (including lists). It represents an improvement in both speed and functionality over subset. The function ss is an improvement of [ to subset (vectors) matrices and data frames without dropping dimensions. It is significantly faster than [.data.frame. For subsetting columns alone, please see selecting and replacing columns.

For ordinary vectors, the result is .Call(C_subsetVector, x, subset), where C_subsetVector is an internal function in the data.table package. The subset can be integer or logical. Appropriate errors are delivered for wrong use.

For matrices the implementation is all base-R but slightly more efficient and more versatile than subset.matrix. Thus it is possible to subset matrix rows using logical or integer vectors, or character vectors matching rownames. The drop argument is passed on to the indexing method for matrices.

For both matrices and data frames, the … argument can be used to subset columns, and is evaluated in a non-standard way. Thus it can support vectors of column names, indices or logical vectors, but also multiple comma separated column names passed without quotes, each of which may also be replaced by a sequence of columns i.e. col1:coln, and new column names may be assigned e.g. fsubset(data, col1 > 20, newname = col2, col3:col6) (see examples).

For data frames, the subset argument is also evaluated in a non-standard way. Thus next to vector of row-indices or logical vectors, it supports logical expressions of the form col2 > 5 & col2 < col3 etc. (see examples). The data frame method uses C_subsetDT, an internal C function from the data.table package to subset data frames, hence it is significantly faster than subset.data.frame. If fast data frame subsetting is required but no non-standard evaluation, the function ss is slightly simpler and faster.

Factors may have empty levels after subsetting; unused levels are not automatically removed. See fdroplevels for a way to drop all unused levels from a data frame.

Examples

Run this code

# NOT RUN {
fsubset(airquality, Temp > 90, Ozone, Temp)
fsubset(airquality, Temp > 90, OZ = Ozone, Temp) # With renaming
fsubset(airquality, Day == 1, -Temp)
fsubset(airquality, Day == 1, -(Day:Temp))
fsubset(airquality, Day == 1, Ozone:Wind)
fsubset(airquality, Day == 1 & !is.na(Ozone), Ozone:Wind, Month)

ss(airquality, 1:10, 2:3)         # Significantly faster than airquality[1:10, 2:3]
fsubset(airquality, 1:10, 2:3)    # This is possible but not advised
# }

Run the code above in your browser using DataLab