setkey: Create key on a data table

Description

In data.table parlance, all set* functions change their input by reference. That is, no copy is made at all, other than temporary working memory, which is as large as one column. The only other data.table operator that modifies input by reference is :=. Check out the See Also section below for other set* function data.table provides.

setkey() sorts a data.table and marks it as sorted (with an attribute sorted). The sorted columns are the key. The key can be any columns in any order. The columns are sorted in ascending order always. The table is changed by reference and is therefore very memory efficient.

setindex() creates an index (or indices) on provided columns. This index is simply an order of the dataset's according to the provided columns. This order is stored as a data.table attribute, and the dataset retains the original order in memory. See the Secondary indices and auto indexing vignette for more details.

key() returns the data.table's key if it exists, and NULL if none exist.

haskey() returns a logical TRUE/FALSE depending on whether the data.table has a key (or not).

Usage

setkey(x, ..., verbose=getOption("datatable.verbose"), physical = TRUE)
setkeyv(x, cols, verbose=getOption("datatable.verbose"), physical = TRUE)
setindex(...)
setindexv(x, cols, verbose=getOption("datatable.verbose"))
key(x)
indices(x, vectors = FALSE)
haskey(x)
key(x) <- value   #  DEPRECATED, please use setkey or setkeyv instead.

Arguments

A data.table.

…

The columns to sort by. Do not quote the column names. If … is missing (i.e. setkey(DT)), all the columns are used. NULL removes the key.

cols

A character vector of column names. For setindexv, this can be a list of character vectors, in which case each element will be applied as an index.

value

In (deprecated) key<-, a character vector (only) of column names.

verbose

Output status and information.

physical

TRUE changes the order of the data in RAM. FALSE adds a secondary key a.k.a. index.

vectors

logical scalar default FALSE, when set to TRUE then list of character vectors is returned, each vector refers to one index.

Value

The input is modified by reference, and returned (invisibly) so it can be used in compound statements; e.g., setkey(DT,a)[J("foo")]. If you require a copy, take a copy first (using DT2=copy(DT)). copy() may also sometimes be useful before := is used to subassign to a column by reference. See ?copy.

Details

setkey reorders (or sorts) the rows of a data.table by the columns provided. In versions 1.9+, for integer columns, a modified version of base's counting sort is implemented, which allows negative values as well. It is extremely fast, but is limited by the range of integer values being <= 1e5. If that fails, it falls back to a (fast) 4-pass radix sort for integers, implemented based on Pierre Terdiman's and Michael Herf's code (see links below). Similarly, a very fast 6-pass radix order for columns of type double is also implemented. This gives a speed-up of about 5-8x compared to 1.8.10 on setkey and all internal order/sort operations. Fast radix sorting is also implemented for character and bit64::integer64 types.

The sort is stable; i.e., the order of ties (if any) is preserved, in both versions - <=1.8.10 and >= 1.9.0.

In data.table versions <= 1.8.10, for columns of type integer, the sort is attempted with the very fast "radix" method in sort.list. If that fails, the sort reverts to the default method in order. For character vectors, data.table takes advantage of R's internal global string cache and implements a very efficient order, also exported as chorder.

In v1.7.8, the key<- syntax was deprecated. The <- method copies the whole table and we know of no way to avoid that copy without a change in R itself. Please use the set* functions instead, which make no copy at all. setkey accepts unquoted column names for convenience, whilst setkeyv accepts one vector of column names.

The problem (for data.table) with the copy by key<- (other than being slower) is that R doesn't maintain the over allocated truelength, but it looks as though it has. Adding a column by reference using := after a key<- was therefore a memory overwrite and eventually a segfault; the over allocated memory wasn't really there after key<-'s copy. data.tables now have an attribute .internal.selfref to catch and warn about such copies. This attribute has been implemented in a way that is friendly with identical() and object.size().

For the same reason, please use the other set* functions which modify objects by reference, rather than using the <- operator which results in copying the entire object.

It isn't good programming practice, in general, to use column numbers rather than names. This is why setkey and setkeyv only accept column names. If you use column numbers then bugs (possibly silent) can more easily creep into your code as time progresses if changes are made elsewhere in your code; e.g., if you add, remove or reorder columns in a few months time, a setkey by column number will then refer to a different column, possibly returning incorrect results with no warning. (A similar concept exists in SQL, where "select * from ..." is considered poor programming style when a robust, maintainable system is required.) If you really wish to use column numbers, it's possible but deliberately a little harder; e.g., setkeyv(DT,colnames(DT)[1:2]).

References

http://en.wikipedia.org/wiki/Radix_sort http://en.wikipedia.org/wiki/Counting_sort http://cran.at.r-project.org/web/packages/bit/index.html http://stereopsis.com/radix.html

Examples

Run this code

# NOT RUN {
# Type 'example(setkey)' to run these at prompt and browse output

DT = data.table(A=5:1,B=letters[5:1])
DT # before
setkey(DT,B)          # re-orders table and marks it sorted.
DT # after
tables()              # KEY column reports the key'd columns
key(DT)
keycols = c("A","B")
setkeyv(DT,keycols)   # rather than key(DT)<-keycols (which copies entire table)

DT = data.table(A=5:1,B=letters[5:1])
DT2 = DT              # does not copy
setkey(DT2,B)         # does not copy-on-write to DT2
identical(DT,DT2)     # TRUE. DT and DT2 are two names for the same keyed table

DT = data.table(A=5:1,B=letters[5:1])
DT2 = copy(DT)        # explicit copy() needed to copy a data.table
setkey(DT2,B)         # now just changes DT2
identical(DT,DT2)     # FALSE. DT and DT2 are now different tables

DT = data.table(A=5:1,B=letters[5:1])
setindex(DT)          # set indices
setindex(DT, A)
setindex(DT, B)
indices(DT)           # get indices single vector
indices(DT, vectors = TRUE) # get indices list
# }

Run the code above in your browser using DataLab