setorder: Fast row reordering of a data.table by reference

Description

In data.table parlance, all set* functions change their input by reference. That is, no copy is made at all, other than temporary working memory, which is as large as one column.. The only other data.table operator that modifies input by reference is :=. Check out the See Also section below for other set* function data.table provides.

setorder (and setorderv) reorders the rows of a data.table based on the columns (and column order) provided. It reorders the table by reference and is therefore very memory efficient.

Also x[order(.)] is now optimised internally to use data.table's fast order by default. data.table always reorders in C-locale. To sort by session locale, use x[base::order(.)] instead.

bit64::integer64 type is also supported for reordering rows of a data.table.

Usage

setorder(x, ..., na.last=FALSE)
setorderv(x, cols, order=1L, na.last=FALSE)
# optimised to use data.table's internal fast order
# x[order(., na.last=TRUE)]

Arguments

A data.table.

...

The columns to sort by. Do not quote column names. If ... is missing (ex: setorder(x)), x is rearranged based on all columns in ascending order by default. To sort by a column in descending order prefix a "-", i.e., setorder(x, a, -b, c). The -b works when b is of type character as well.

cols

A character vector of column names of x, to which to order by. Do not add "-" here. Use order argument instead.

order

An integer vector with only possible values of 1 and -1, corresponding to ascending and descending order. The length of order must be either 1 or equal to that of cols. If length(order) == 1, it's recycled to length(cols).

na.last

logical. If TRUE, missing values in the data are placed last; if FALSE, they are placed first; if NA they are removed. na.last=NA is valid only for x[order(., na.last)] and it's default is TRUE. setorder and setorderv only accept TRUE/FALSE with default FALSE.

Value

The input is modified by reference, and returned (invisibly) so it can be used in compound statements; e.g., setorder(DT,a,-b)[, cumsum(c), by=list(a,b)]. If you require a copy, take a copy first (using DT2 = copy(DT)). See ?copy.

Details

data.table implements fast radix based ordering. In versions <= 1.9.2, it was only capable of increasing order (ascending). From 1.9.4 on, the functionality has been extended to decreasing order (descending) as well.

setorder accepts unquoted column names (with names preceded with a - sign for descending order) and reorders data.table rows by reference, for e.g., setorder(x, a, -b, c). Note that -b also works with columns of type character unlike base::order, which requires -xtfrm(y) instead (which is slow). setorderv in turn accepts a character vector of column names and an integer vector of column order separately.

Note that setkey still requires and will always sort only in ascending order, and is different from setorder in that it additionally sets the sorted attribute.

na.last argument, by default, is FALSE for setorder and setorderv to be consistent with data.table's setkey and is TRUE for x[order(.)] to be consistent with base::order. Only x[order(.)] can have na.last = NA as it's a subset operation as opposed to setorder or setorderv which reorders the data.table by reference.

If setorder results in reordering of the rows of a keyed data.table, then it's key will be set to NULL.

Examples

Run this code

# NOT RUN {
set.seed(45L)
DT = data.table(A=sample(3, 10, TRUE), 
         B=sample(letters[1:3], 10, TRUE), C=sample(10))

# setorder
setorder(DT, A, -B)

# same as above, but using setorderv
setorderv(DT, c("A", "B"), c(1, -1))
# }

Run the code above in your browser using DataLab