Learn R Programming

Laurae (version 0.0.0.9001)

DTrbind: data.table row binding (nearly without) copy

Description

This function attempts to rbind two data.tables without making copies. Compared to rbind, this can result to up to 3X memory efficiency. By default, a 2X memory efficiency is minimal with frequent garbage collects.

Usage

DTrbind(dt1, dt2, low_mem = FALSE, collect = 0, silent = TRUE)

Arguments

dt1
Type: data.table. The data.table to combine on.
dt2
Type: data.table. The data.table to "copy" on dt1
low_mem
Type: boolean. Unallows dt1 and dt2 twice in memory by deleting dt1 and dt2 (WARNING: empties your dt2) to save memory when set to TRUE. Setting it to FALSE allow dt1 and dt2 to reside twice in memory, therefore memory usage increases. Defaults to FALSE.
collect
Type: integer. Forces a garbage collect every collect iterations to clear up memory. Setting this to 1 along with low_mem = TRUE leads to the lowest possible memory usage one can ever get to merge two data.tables. It also prints verbose information about the process everytime it garbage collects. Setting this to 0 leads to no garbage collect. Lower values increases the time required to bind the data.tables. Defauls to 0.
silent
Type: boolean. Force silence during garbage collection iterations at no speed cost. Defaults to TRUE.

Value

A data.table based on dt1.

Details

Warning: dt1 and dt2 are pointers only even if you pass the objects to this function. This is how memory efficiency is achieved. dt1 and dt2 gets overwritten on the fly.

Examples

Run this code
library(data.table)
df1 <- data.frame(matrix(nrow = 50000, ncol = 1000))
df2 <- data.frame(matrix(nrow = 50000, ncol = 1000))
setDT(df1)
setDT(df2)
df1[is.na(df1)] <- 1
gc()
df2[is.na(df2)] <- 2
gc() # look memory usage
# open a task manager to check current RAM usage
df1 <- DTrbind(df1, df2, low_mem = TRUE, collect = 20, silent = FALSE)
# check RAM usage in a task manager: it is identical to what we had previously!
gc() # gives no gain
df3 <- data.frame(matrix(nrow = 50000, ncol = 1000))
setDT(df3)
# look on task manager the current RAM usage
#df1 <- rbind(df1, df3) # RAM usage explodes!

Run the code above in your browser using DataLab