setkey
sorts a data.table
and marks it as sorted with an
attribute "sorted"
. The sorted columns are the key. The key can be any
number of columns. The data is always sorted in ascending order with NA
s
(if any) always first. The table is changed by reference and there is
no memory used for the key (other than marking which columns the data is sorted by).
There are three reasons setkey
is desirable:
binary search and joins are faster when they detect they can use an existing key
grouping by a leading subset of the key columns is faster because the groups are already gathered contiguously in RAM
simpler shorter syntax; e.g. DT["id",]
finds the group "id" in the first column of DT
's key using binary search. It may be helpful to think of a key as super-charged rownames: multi-column and multi-type.
NA
s are always first because:
NA
is internally INT_MIN
(a large negative number) in R. Keys and indexes are always in increasing order so if NA
s are first, no special treatment or branch is needed in many data.table
internals involving binary search. It is not optional to place NA
s last for speed, simplicity and rubustness of internals at C level.
if any NA
s are present then we believe it is better to display them up front (rather than hiding them at the end) to reduce the risk of not realizing NA
s are present.
In data.table
parlance, all set*
functions change their input
by reference. That is, no copy is made at all other than for temporary
working memory, which is as large as one column. The only other data.table
operator that modifies input by reference is :=
. Check out the
See Also
section below for other set*
functions data.table
provides.
setindex
creates an index for the provided columns. This index is simply an
ordering vector of the dataset's rows according to the provided columns. This order vector
is stored as an attribute of the data.table
and the dataset retains the original order
of rows in memory. See the vignette("datatable-secondary-indices-and-auto-indexing")
for more details.
key
returns the data.table
's key if it exists; NULL
if none exists.
haskey
returns TRUE
/FALSE
if the data.table
has a key.