split: Split data.table into chunks in a list

Description

Split method for data.table. Faster and more flexible. Be aware that processing list of data.tables will be generally much slower than manipulation in single data.table by group using by argument, read more on data.table.

Usage

"split"(x, f, drop = FALSE, by, sorted = FALSE, keep.by = TRUE, flatten = TRUE,  ..., verbose = getOption("datatable.verbose"))

Arguments

data.table

factor or list of factors. Same as split.data.frame. Use by argument instead, this is just for consistency with data.frame method.

drop

logical. Default FALSE will not drop empty list elements caused by factor levels not refered by that factors. Works also with new arguments of split data.table method.

character vector. Column names on which split should be made. For length(by) > 1L and flatten FALSE it will result nested lists with data.tables on leafs.

sorted

When default FALSE it will retain the order of groups we are splitting on. When TRUE then sorted list(s) are returned. Does not have effect for f argument.

keep.by

logical default TRUE. Keep column provided to by argument.

flatten

logical default TRUE will unlist nested lists of data.tables. When using f results are always flattened to list of data.tables.

...

passed to data.frame way of processing when using f argument.

verbose

logical default FALSE. When TRUE it will print to console data.table split query used to split data.

Value

List of data.tables. If using flatten FALSE and length(by) > 1L then recursively nested lists having data.tables as leafs of grouping according to by argument.

Details

Argument f is just for consistency in usage to data.frame method. Recommended is to use by argument instead, it will be faster, more flexible, and by default will preserve order according to order in data.

Examples

Run this code

set.seed(123)
dt = data.table(x1 = rep(letters[1:2], 6), 
                x2 = rep(letters[3:5], 4), 
                x3 = rep(letters[5:8], 3), 
                y = rnorm(12))
dt = dt[sample(.N)]
df = as.data.frame(dt)

# split consistency with data.frame: `x, f, drop`
all.equal(
    split(dt, list(dt$x1, dt$x2)),
    lapply(split(df, list(df$x1, df$x2)), setDT)
)

# nested list using `flatten` arguments
split(dt, by=c("x1", "x2"))
split(dt, by=c("x1", "x2"), flatten=FALSE)

# dealing with factors
fdt = dt[, c(lapply(.SD, as.factor), list(y=y)), .SDcols=x1:x3]
fdf = as.data.frame(fdt)
sdf = split(fdf, list(fdf$x1, fdf$x2))
all.equal(
    split(fdt, by=c("x1", "x2"), sorted=TRUE),
    lapply(sdf[sort(names(sdf))], setDT)
)

# factors having unused levels, drop FALSE, TRUE
fdt = dt[, .(x1 = as.factor(c(as.character(x1), "c"))[-13L],
             x2 = as.factor(c("a", as.character(x2)))[-1L],
             x3 = as.factor(c("a", as.character(x3), "z"))[c(-1L,-14L)],
             y = y)]
fdf = as.data.frame(fdt)
sdf = split(fdf, list(fdf$x1, fdf$x2))
all.equal(
    split(fdt, by=c("x1", "x2"), sorted=TRUE),
    lapply(sdf[sort(names(sdf))], setDT)
)
sdf = split(fdf, list(fdf$x1, fdf$x2), drop=TRUE)
all.equal(
    split(fdt, by=c("x1", "x2"), sorted=TRUE, drop=TRUE),
    lapply(sdf[sort(names(sdf))], setDT)
)