merge: Merge Two Data Tables

Description

Relatively quick merge of two data.tables based on common key columns (by default).

This merge method for data.table is meant to act very similarly to the merge method for data.frame, with the major exception being that the default columns used to merge two data.table inputs are the shared key columns rather than the shared columns with the same names. For a more data.table-centric (and faster) way of merging two data.tables, see [.data.table; e.g., x[y, ...]. In recent versions, however, merge() is much closer to the speed of x[y, ...]. See FAQ 1.12 for a detailed comparison of merge and x[y, ...]. Note that merge is a generic function in base R. It dispatches to either the merge.data.frame method or merge.data.table method depending on the class of its first argument. Typing ?merge at the prompt should present a choice of two links: the help pages for each of these merge methods. You don't need to use the full name of the method although you may if you wish; i.e., merge(DT1,DT2) is idiomatic R but you can bypass method dispatch by going direct if you wish: merge.data.table(DT1,DT2).

Usage

## S3 method for class 'data.table':
merge(x, y, by = NULL, all = FALSE, all.x = all, all.y = all, suffixes = c(".x", ".y"), 
allow.cartesian=getOption("datatable.allow.cartesian"),  # default FALSE
...)

Arguments

x, y

data tables. y is coerced to a data.table if it isn't one already.

A vector of shared column names in x and y to merge on. This defaults to the shared key columns between the two tables. If y has no key columns, this defaults to the key of x.

all

logical; all = TRUE is shorthand to save setting both all.x = TRUE and all.y = TRUE.

all.x

logical; if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have 'NA's in those columns that are usually filled with values from

all.y

logical; analogous to all.x above.

suffixes

A character(2) specifying the suffixes to be used for making non-by column names unique. The suffix behavior works in a similar fashion as the merge.data.frame me

allow.cartesian

See allow.cartesian in [.data.table.

...

Not used at this time.

Value

A new data.table based on the merged data tables, sorted by the columns set (or inferred for) the by argument.

Details

Note that if the specified columns in by is not the key (or head of the key) of x or y, then a copy is first rekeyed prior to performing the merge. This might make this function perform slower than you are expecting. When secondary keys are implemented in future we expect performance in this case to improve.

Examples

Run this code

(dt1 <- data.table(A = letters[1:10], X = 1:10, key = "A"))
    (dt2 <- data.table(A = letters[5:14], Y = 1:10, key = "A"))
    merge(dt1, dt2)
    merge(dt1, dt2, all = TRUE)

    (dt1 <- data.table(A = letters[rep(1:3, 2)], X = 1:6, key = "A"))
    (dt2 <- data.table(A = letters[rep(2:4, 2)], Y = 6:1, key = "A"))
    merge(dt1, dt2, allow.cartesian=TRUE)

    (dt1 <- data.table(A = c(rep(1L, 5), 2L), B = letters[rep(1:3, 2)], X = 1:6, key = "A,B"))
    (dt2 <- data.table(A = c(rep(1L, 5), 2L), B = letters[rep(2:4, 2)], Y = 6:1, key = "A,B"))
    merge(dt1, dt2)
    merge(dt1, dt2, by="B", allow.cartesian=TRUE)

    # test it more:
    d1 <- data.table(a=rep(1:2,each=3), b=1:6, key="a,b")
    d2 <- data.table(a=0:1, bb=10:11, key="a")
    d3 <- data.table(a=0:1, key="a")
    d4 <- data.table(a=0:1, b=0:1, key="a,b")

    merge(d1, d2)
    merge(d2, d1)
    merge(d1, d2, all=TRUE)
    merge(d2, d1, all=TRUE)

    merge(d3, d1)
    merge(d1, d3)
    merge(d1, d3, all=TRUE)
    merge(d3, d1, all=TRUE)

    merge(d1, d4)
    merge(d1, d4, by="a", suffixes=c(".d1", ".d4"))
    merge(d4, d1)
    merge(d1, d4, all=TRUE)
    merge(d4, d1, all=TRUE)

Run the code above in your browser using DataLab