Learn R Programming

data.table (version 1.10.0)

data.table-package: Enhanced data.frame

Description

data.table inherits from data.frame. It offers fast and nemory efficient: file reader and writer, aggregations, updates, equi, non-equi, rolling, range and interval joins, in a short and flexible syntax, for faster development.

It is inspired by A[B] syntax in R where A is a matrix and B is a 2-column matrix. Since a data.table is a data.frame, it is compatible with R functions and packages that accept only data.frames. Type vignette(package="data.table") to get started. The Introduction to data.table vignette introduces data.table's x[i, j, by] syntax and is a good place to start. If you have read the vignettes and the help page below, please feel free to ask questions on Stack Overflow data.table tag or on datatable-help mailing list. To report a bug please type: bug.report(package = "data.table"). Please check the homepage for up to the minute live NEWS. Tip: one of the quickest ways to learn the features is to type example(data.table) and study the output at the prompt.

Usage

data.table(..., keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFactors=FALSE)
"["(x, i, j, by, keyby, with = TRUE, nomatch = getOption("datatable.nomatch"), # default: NA_integer_ mult = "all", roll = FALSE, rollends = if (roll=="nearest") c(TRUE,TRUE) else if (roll>=0) c(FALSE,TRUE) else c(TRUE,FALSE), which = FALSE, .SDcols, verbose = getOption("datatable.verbose"), # default: FALSE allow.cartesian = getOption("datatable.allow.cartesian"), # default: FALSE drop = NULL, on = NULL)

Arguments

...
Just as ... in data.frame. Usual recycling rules are applied to vectors of different lengths to create a list of equal length vectors.
keep.rownames
If ... is a matrix or data.frame, TRUE will retain the rownames of that object in a column named rn.
check.names
Just as check.names in data.frame.
key
Character vector of one or more column names which is passed to setkey. It may be a single comma separated string such as key="x,y,z", or a vector of names such as key=c("x","y","z").
stringsAsFactors
Logical (default is FALSE). Convert all character columns to factors?
x
A data.table.
i
Integer, logical or character vector, single column numeric matrix, expression of column names, list, data.frame or data.table.

integer and logical vectors work the same way they do in [.data.frame except logical NAs are treated as FALSE.

expression is evaluated within the frame of the data.table (i.e. it sees column names as if they are variables) and can evaluate to any of the other types.

character, list and data.frame input to i is converted into a data.table internally using as.data.table.

If i is a data.table, the columns in i to be matched against x can be specified using one of these ways:

  • on argument (see below). It allows for both equi- and the newly implemented non-equi joins.

  • If not, x must be keyed. Key can be set using setkey. If i is also keyed, then first key column of i is matched against first key column of x, second against second, etc..
  • If i is not keyed, then first column of i is matched against first key column of x, second column of i against second key column of x, etc...

    This is summarised in code as min(length(key(x)), if (haskey(i)) length(key(i)) else ncol(i)).

    Using on= is recommended (even during keyed joins) as it helps understand the code better and also allows for non-equi joins. When the binary operator == alone is used, an equi join is performed. In SQL terms, x[i] then performs a right join by default. i prefixed with ! signals a not-join or not-select.

    Support for non-equi join was recently implemented, which allows for other binary operators >=, >, <= and="" <<="" code="">.

    See Keys and fast binary search based subset and Secondary indices and auto indexing.

    Advanced: When i is a single variable name, it is not considered an expression of column names and is instead evaluated in calling scope.

    j
    When with=TRUE (default), j is evaluated within the frame of the data.table; i.e., it sees column names as if they are variables. This allows to not just select columns in j, but also compute on them e.g., x[, a] and x[, sum(a)] returns x$a and sum(x$a) as a vector respectively. x[, .(a, b)] and x[, .(sa=sum(a), sb=sum(b))] returns a two column data.table each, the first simply selecting columns a, b and the second computing their sums.

    The expression `.()` is a shorthand alias to list(); they both mean the same. As long as j returns a list, each element of the list becomes a column in the resulting data.table. This is the default enhanced mode.

    When with=FALSE, j can only be a vector of column names or positions to select (as in data.frame).

    Advanced: j also allows the use of special read-only symbols: .SD, .N, .I, .GRP, .BY.

    Advanced: When i is a data.table, the columns of i can be referred to in j by using the prefix i., e.g., X[Y, .(val, i.val)]. Here val refers to X's column and i.val Y's.

    Advanced: Columns of x can now be referred to using the prefix x. and is particularly useful during joining to refer to x's join columns as they are otherwise masked by i's. For example, X[Y, .(x.a-i.a, b), on="a"].

    See Introduction to data.table vignette and examples.

    by
    Column names are seen as if they are variables (as in j when with=TRUE). The data.table is then grouped by the by and j is evaluated within each group. The order of the rows within each group is preserved, as is the order of the groups. by accepts:

    • A single unquoted column name: e.g., DT[, .(sa=sum(a)), by=x]

  • a list() of expressions of column names: e.g., DT[, .(sa=sum(a)), by=.(x=x>0, y)]
  • a single character string containing comma separated column names (where spaces are significant since column names may contain spaces even at the start or end): e.g., DT[, sum(a), by="x,y,z"]
  • a character vector of column names: e.g., DT[, sum(a), by=c("x", "y")]
  • or of the form startcol:endcol: e.g., DT[, sum(a), by=x:z]
  • Advanced: When i is a list (or data.frame or data.table), DT[i, j, by=.EACHI] evaluates j for the groups in `DT` that each row in i joins to. That is, you can join (in i) and aggregate (in j) simultaneously. We call this grouping by each i. See this StackOverflow answer for a more detailed explanation until we roll out vignettes.

    Advanced: In the X[Y, j] form of grouping, the j expression sees variables in X first, then Y. We call this join inherited scope. If the variable is not in X or Y then the calling frame is searched, its calling frame, and so on in the usual way up to and including the global environment.

    keyby
    Same as by, but with an additional setkey() run on the by columns of the result, for convenience. It is common practice to use `keyby=` routinely when you wish the result to be sorted.
    with
    By default with=TRUE and j is evaluated within the frame of x; column names can be used as variables.

    When with=FALSE j is a character vector of column names, a numeric vector of column positions to select or of the form startcol:endcol, and the value returned is always a data.table. with=FALSE is often useful in data.table to select columns dynamically. Note that x[, cols, with=FALSE] is equivalent to x[, .SD, .SDcols=cols].

    nomatch
    Same as nomatch in match. When a row in i has no match to x, nomatch=NA (default) means NA is returned. 0 means no rows will be returned for that row of i. Use options(datatable.nomatch=0) to change the default value (used when nomatch is not supplied).
    mult
    When i is a list (or data.frame or data.table) and multiple rows in x match to the row in i, mult controls which are returned: "all" (default), "first" or "last".
    roll
    When i is a data.table and its row matches to all but the last x join column, and its value in the last i join column falls in a gap (including after the last observation in x for that group), then:

    • +Inf (or TRUE) rolls the prevailing value in x forward. It is also known as last observation carried forward (LOCF).
    • -Inf rolls backwards instead; i.e., next observation carried backward (NOCB).
    • finite positive or negative number limits how far values are carried forward or backward.
    • "nearest" rolls the nearest value instead.

    Rolling joins apply to the last join column, generally a date but can be any variable. It is particularly fast using a modified binary search. A common idiom is to select a contemporaneous regular time series (dts) across a set of identifiers (ids): DT[CJ(ids,dts),roll=TRUE] where DT has a 2-column key (id,date) and CJ stands for cross join.

    rollends
    A logical vector length 2 (a single logical is recycled) indicating whether values falling before the first value or after the last value for a group should be rolled as well.
    • If rollends[2]=TRUE, it will roll the last value forward. TRUE by default for LOCF and FALSE for NOCB rolls.
    • If rollends[1]=TRUE, it will roll the first value backward. TRUE by default for NOCB and FALSE for LOCF rolls.

    When roll is a finite number, that limit is also applied when rolling the ends.

    which
    TRUE returns the row numbers of x that i matches to. If NA, returns the row numbers of i that have no match in x. By default FALSE and the rows in x that match are returned.
    .SDcols
    Specifies the columns of x to be included in the special symbol .SD which stands for Subset of data.table. May be character column names or numeric positions. This is useful for speed when applying a function through a subset of (possible very many) columns; e.g., DT[, lapply(.SD, sum), by="x,y", .SDcols=301:350]. For convenient interactive use, the form startcol:endcol is also allowed (as in by), e.g., DT[, lapply(.SD, sum), by=x:y, .SDcols=a:f]
    verbose
    TRUE turns on status and information messages to the console. Turn this on by default using options(datatable.verbose=TRUE). The quantity and types of verbosity may be expanded in future.

    allow.cartesian
    FALSE prevents joins that would result in more than nrow(x)+nrow(i) rows. This is usually caused by duplicate values in i's join columns, each of which join to the same group in `x` over and over again: a misspecified join. Usually this was not intended and the join needs to be changed. The word 'cartesian' is used loosely in this context. The traditional cartesian join is (deliberately) difficult to achieve in data.table: where every row in i joins to every row in x (a nrow(x)*nrow(i) row result). 'cartesian' is just meant in a 'large multiplicative' sense.
    drop
    Never used by data.table. Do not use. It needs to be here because data.table inherits from data.frame. See datatable-faq.
    on
    Indicate which columns in i should be joined with columns in x along with the type of binary operator to join with. When specified, this overrides the keys set on x and i. There are multiple ways of specifying on argument:
    • As a character vector, e.g., X[Y, on=c("a", "b")]. This assumes both these columns are present in X and Y.
    • As a named character vector, e.g., X[Y, on=c(x="a", y="b")]. This is useful when column names to join by are different between the two tables.

      NB: X[Y, on=c("a", y="b")] is also possible if column "a" is common between the two tables.

    • For convenience during interactive scenarios, it is also possible to use .() syntax as X[Y, on=.(a, b)].
    • From v1.9.8, (non-equi) joins using binary operators >=, >, <=, <<="" code=""> are also possible, e.g., X[Y, on=c("x>=a", "y<=b")]< code="">, or for interactive use as X[Y, on=.(x>=a, y<=b)]< code="">.

    See examples as well as Secondary indices and auto indexing.

    Details

    data.table builds on base R functionality to reduce 2 types of time:

    1. programming time (easier to write, read, debug and maintain), and
    2. compute time (fast and memory efficient).

    The general form of data.table syntax is:

        DT[ i,  j,  by ] # + extra arguments
            |   |   |
            |   |    -------> grouped by what?
            |    -------> what to do?
             ---> on which rows?
    

    The way to read this out loud is: "Take DT, subset rows by i, then compute j grouped by by. Here are some basic usage examples expanding on this definition. See the vignette (and examples) for working examples.

        X[, a]                      # return col 'a' from X as vector. If not found, search in parent frame.
        X[, .(a)]                   # same as above, but return as a data.table.
        X[, sum(a)]                 # return sum(a) as a vector (with same scoping rules as above)
        X[, .(sum(a)), by=c]        # get sum(a) grouped by 'c'.
        X[, sum(a), by=c]           # same as above, .() can be ommitted in by on single expression for convenience
        X[, sum(a), by=c:f]         # get sum(a) grouped by all columns in between 'c' and 'f' (both inclusive)

    X[, sum(a), keyby=b] # get sum(a) grouped by 'b', and sort that result by the grouping column 'b' X[, sum(a), by=b][order(b)] # same order as above, but by chaining compound expressions X[c>1, sum(a), by=c] # get rows where c>1 is TRUE, and on those rows, get sum(a) grouped by 'c' X[Y, .(a, b), on="c"] # get rows where Y$c == X$c, and select columns 'X$a' and 'X$b' for those rows X[Y, .(a, i.a), on="c"] # get rows where Y$c == X$c, and then select 'X$a' and 'Y$a' (=i.a) X[Y, sum(a*i.a), on="c" by=.EACHI] # for *each* 'Y$c', get sum(a*i.a) on matching rows in 'X$c'

    X[, plot(a, b), by=c] # j accepts any expression, generates plot for each group and returns no data # see ?assign to add/update/delete columns by reference using the same consistent interface

    A data.table is a list of vectors, just like a data.frame. However :

    1. it never has or uses rownames. Rownames based indexing can be done by setting a key of one or more columns or done ad-hoc using the on argument (now preferred).
    2. it has enhanced functionality in [.data.table for fast joins of keyed tables, fast aggregation, fast last observation carried forward (LOCF) and fast add/modify/delete of columns by reference with no copy at all.

    See the see also section for the several other methods that are available for operating on data.tables efficiently.

    References

    https://github.com/Rdatatable/data.table/wiki (data.table homepage) http://crantastic.org/packages/data-table (User reviews) http://en.wikipedia.org/wiki/Binary_search

    See Also

    special-symbols, data.frame, [.data.frame, as.data.table, setkey, setorder, setDT, setDF, J, SJ, CJ, merge.data.table, tables, test.data.table, IDateTime, unique.data.table, copy, :=, alloc.col, truelength, rbindlist, setNumericRounding, datatable-optimize, fsetdiff, funion, fintersect, fsetequal, anyDuplicated, uniqueN, rowid, rleid, na.omit, frank

    Examples

    Run this code
    ## Not run: 
    # example(data.table)  # to run these examples at the prompt## End(Not run)
    
    DF = data.frame(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
    DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
    DF
    DT
    identical(dim(DT), dim(DF))    # TRUE
    identical(DF$a, DT$a)          # TRUE
    is.list(DF)                    # TRUE
    is.list(DT)                    # TRUE
    
    is.data.frame(DT)              # TRUE
    
    tables()
    
    # basic row subset operations
    DT[2]                          # 2nd row
    DT[3:2]                        # 3rd and 2nd row
    DT[order(x)]                   # no need for order(DT$x)
    DT[order(x), ]                 # same as above. The ',' is optional
    DT[y>2]                        # all rows where DT$y > 2
    DT[y>2 & v>5]                  # compound logical expressions
    DT[!2:4]                       # all rows other than 2:4
    DT[-(2:4)]                     # same
    
    # select|compute columns data.table way
    DT[, v]                        # v column (as vector)
    DT[, list(v)]                  # v column (as data.table)
    DT[, .(v)]                     # same as above, .() is a shorthand alias to list()
    DT[, sum(v)]                   # sum of column v, returned as vector
    DT[, .(sum(v))]                # same, but return data.table (column autonamed V1)
    DT[, .(sv=sum(v))]             # same, but column named "sv"
    DT[, .(v, v*2)]                # return two column data.table, v and v*2
    
    # subset rows and select|compute data.table way
    DT[2:3, sum(v)]                # sum(v) over rows 2 and 3, return vector
    DT[2:3, .(sum(v))]             # same, but return data.table with column V1
    DT[2:3, .(sv=sum(v))]          # same, but return data.table with column sv 
    DT[2:5, cat(v, "\n")]          # just for j's side effect
    
    # select columns the data.frame way
    DT[, 2, with=FALSE]            # 2nd column, returns a data.table always
    colNum = 2
    DT[, colNum, with=FALSE]       # same, equivalent to DT[, .SD, .SDcols=colNum]
    DT[["v"]]                      # same as DT[, v] but much faster
    
    # grouping operations - j and by
    DT[, sum(v), by=x]             # ad hoc by, order of groups preserved in result
    DT[, sum(v), keyby=x]          # same, but order the result on by cols
    DT[, sum(v), by=x][order(x)]   # same but by chaining expressions together
    
    # fast ad hoc row subsets (subsets as joins)
    DT["a", on="x"]                # same as x == "a" but uses binary search (fast)
    DT["a", on=.(x)]               # same, for convenience, no need to quote every column
    DT[.("a"), on="x"]             # same
    DT[x=="a"]                     # same, single "==" internally optimised to use binary search (fast)
    DT[x!="b" | y!=3]              # not yet optimized, currently vector scan subset
    DT[.("b", 3), on=c("x", "y")]  # join on columns x,y of DT; uses binary search (fast)
    DT[.("b", 3), on=.(x, y)]      # same, but using on=.()
    DT[.("b", 1:2), on=c("x", "y")]             # no match returns NA
    DT[.("b", 1:2), on=.(x, y), nomatch=0]      # no match row is not returned
    DT[.("b", 1:2), on=c("x", "y"), roll=Inf]   # locf, nomatch row gets rolled by previous row
    DT[.("b", 1:2), on=.(x, y), roll=-Inf]      # nocb, nomatch row gets rolled by next row
    DT["b", sum(v*y), on="x"]                   # on rows where DT$x=="b", calculate sum(v*y)
    
    # all together now
    DT[x!="a", sum(v), by=x]                    # get sum(v) by "x" for each i != "a"
    DT[!"a", sum(v), by=.EACHI, on="x"]         # same, but using subsets-as-joins
    DT[c("b","c"), sum(v), by=.EACHI, on="x"]   # same
    DT[c("b","c"), sum(v), by=.EACHI, on=.(x)]  # same, using on=.()
    
    # joins as subsets
    X = data.table(x=c("c","b"), v=8:7, foo=c(4,2))
    X
    
    DT[X, on="x"]                         # right join
    X[DT, on="x"]                         # left join
    DT[X, on="x", nomatch=0]              # inner join
    DT[!X, on="x"]                        # not join
    DT[X, on=.(y<=foo)]                   # NEW non-equi join (v1.9.8+)
    DT[X, on="y<=foo"]                    # same as above
    DT[X, on=c("y<=foo")]                 # same as above
    DT[X, on=.(y>=foo)]                   # NEW non-equi join (v1.9.8+)
    DT[X, on=.(x, y<=foo)]                # NEW non-equi join (v1.9.8+)
    DT[X, .(x,y,x.y,v), on=.(x, y>=foo)]  # Select x's join columns as well
    
    DT[X, on="x", mult="first"]           # first row of each group
    DT[X, on="x", mult="last"]            # last row of each group
    DT[X, sum(v), by=.EACHI, on="x"]      # join and eval j for each row in i
    DT[X, sum(v)*foo, by=.EACHI, on="x"]  # join inherited scope
    DT[X, sum(v)*i.v, by=.EACHI, on="x"]  # 'i,v' refers to X's v column
    DT[X, on=.(x, v>=v), sum(y)*foo, by=.EACHI] # NEW non-equi join with by=.EACHI (v1.9.8+)
    
    # setting keys
    kDT = copy(DT)                        # (deep) copy DT to kDT to work with it.
    setkey(kDT,x)                         # set a 1-column key. No quotes, for convenience.
    setkeyv(kDT,"x")                      # same (v in setkeyv stands for vector)
    v="x"
    setkeyv(kDT,v)                        # same
    # key(kDT)<-"x"                       # copies whole table, please use set* functions instead
    haskey(kDT)                           # TRUE
    key(kDT)                              # "x"
    
    # fast *keyed* subsets
    kDT["a"]                              # subset-as-join on *key* column 'x'
    kDT["a", on="x"]                      # same, being explicit using 'on=' (preferred)
    
    # all together
    kDT[!"a", sum(v), by=.EACHI]          # get sum(v) for each i != "a"
    
    # multi-column key
    setkey(kDT,x,y)                       # 2-column key
    setkeyv(kDT,c("x","y"))               # same
    
    # fast *keyed* subsets on multi-column key
    kDT["a"]                              # join to 1st column of key
    kDT["a", on="x"]                      # on= is optional, but is preferred
    kDT[.("a")]                           # same, .() is an alias for list()
    kDT[list("a")]                        # same
    kDT[.("a", 3)]                        # join to 2 columns
    kDT[.("a", 3:6)]                      # join 4 rows (2 missing)
    kDT[.("a", 3:6), nomatch=0]           # remove missing
    kDT[.("a", 3:6), roll=TRUE]           # locf rolling join
    kDT[.("a", 3:6), roll=Inf]            # same as above
    kDT[.("a", 3:6), roll=-Inf]           # nocb rolling join
    kDT[!.("a")]                          # not join
    kDT[!"a"]                             # same
    
    # more on special symbols, see also ?"special-symbols"
    DT[.N]                                # last row
    DT[, .N]                              # total number of rows in DT
    DT[, .N, by=x]                        # number of rows in each group
    DT[, .SD, .SDcols=x:y]                # select columns 'x' and 'y'
    DT[, .SD[1]]                          # first row of all columns
    DT[, .SD[1], by=x]                    # first row of 'y' and 'v' for each group in 'x'
    DT[, c(.N, lapply(.SD, sum)), by=x]   # get rows *and* sum columns 'v' and 'y' by group
    DT[, .I[1], by=x]                     # row number in DT corresponding to each group
    DT[, grp := .GRP, by=x]               # add a group counter column
    X[, DT[.BY, y, on="x"], by=x]         # join within each group
    
    # add/update/delete by reference (see ?assign)
    print(DT[, z:=42L])                   # add new column by reference
    print(DT[, z:=NULL])                  # remove column by reference
    print(DT["a", v:=42L, on="x"])        # subassign to existing v column by reference
    print(DT["b", v2:=84L, on="x"])       # subassign to new column by reference (NA padded)
    
    DT[, m:=mean(v), by=x][]              # add new column by reference by group
                                          # NB: postfix [] is shortcut to print()
    # advanced usage
    DT = data.table(x=rep(c("b","a","c"),each=3), v=c(1,1,1,2,2,1,1,2,2), y=c(1,3,6), a=1:9, b=9:1)
    
    DT[, sum(v), by=.(y%%2)]              # expressions in by
    DT[, sum(v), by=.(bool = y%%2)]       # same, using a named list to change by column name
    DT[, .SD[2], by=x]                    # get 2nd row of each group
    DT[, tail(.SD,2), by=x]               # last 2 rows of each group
    DT[, lapply(.SD, sum), by=x]          # sum of all (other) columns for each group
    DT[, .SD[which.min(v)], by=x]         # nested query by group
    
    DT[, list(MySum=sum(v),
              MyMin=min(v),
              MyMax=max(v)),
        by=.(x, y%%2)]                    # by 2 expressions
    
    DT[, .(a = .(a), b = .(b)), by=x]     # list columns
    DT[, .(seq = min(a):max(b)), by=x]    # j is not limited to just aggregations
    DT[, sum(v), by=x][V1<20]             # compound query
    DT[, sum(v), by=x][order(-V1)]        # ordering results
    DT[, c(.N, lapply(.SD,sum)), by=x]    # get number of observations and sum per group
    DT[, {tmp <- mean(y); 
          .(a = a-tmp, b = b-tmp)
          }, by=x]                        # anonymous lambdain 'j', j accepts any valid 
                                          # expression. TO REMEMBER: every element of 
                                          # the list becomes a column in result.
    pdf("new.pdf")
    DT[, plot(a,b), by=x]                 # can also plot in 'j'
    dev.off()
    
    # using rleid, get max(y) and min of all cols in .SDcols for each consecutive run of 'v'
    DT[, c(.(y=max(y)), lapply(.SD, min)), by=rleid(v), .SDcols=v:b]
    
    # Follow r-help posting guide, SUPPORT is here (*not* r-help) :
    # http://stackoverflow.com/questions/tagged/data.table
    # or
    # datatable-help@lists.r-forge.r-project.org
    
    ## Not run: 
    # vignette("datatable-intro")
    # vignette("datatable-reference-semantics")
    # vignette("datatable-keys-fast-subset")
    # vignette("datatable-secondary-indices-and-auto-indexing")
    # vignette("datatable-reshape")
    # vignette("datatable-faq")
    # 
    # 
    # test.data.table()          # over 5700 low level tests
    # 
    # # keep up to date with latest stable version on CRAN
    # update.packages()
    # # get the latest devel (needs Rtools for windows, xcode for mac)
    # install.packages("data.table", repos = "https://Rdatatable.github.io/data.table", type = "source")
    # 
    # ## End(Not run)

    Run the code above in your browser using DataLab