Learn R Programming

dwtools (version 0.8.3.9)

build_hierarchy: Build hierarchy

Description

Detect hierarchies in the dataset and normalize to star schema, optionally deploy new model to db.

Usage

build_hierarchy(x, factname = "fact", dimnames = "auto", setkey = TRUE, deploy = FALSE, db.conn.name, .db.conns = getOption("dwtools.conns"), timing = getOption("dwtools.timing"), verbose = getOption("dwtools.verbose"))

Arguments

x
data.table source dataset.
factname
character, default fact.
dimnames
character currently only auto supported, default. Dimension names will be created based on the common words in the fields which forms the dimension.
setkey
logical if return tables with keys.
deploy
logical, TRUE will overwrite the data in the target tables in connection db.conn.name.
db.conn.name
character deploy db connection name.
.db.conns
list of connections uniquely named. See db function.
timing
logical measure timing for vectorized usage, read timing, for single row timing summary use timing(build_hierarchy(...)).
verbose
integer, if greater than 0 then print debugging messages.

Value

List of: tables (multiple normalized R data.tables) cardinality matrix represents groupings between all columns (computionally extensive for big datasets) like length(unq(col1))/nrow(unique(data.table(col1,col2))). lists of parents (including same entity attrs with any cardinality) and list of childs (including same entity attrs with exact cardinality) for each field.

Details

Only basic star schema normalization will be created. All numeric fields is considered as measures, others as dimensions (including integer fields). Allocation of columns to dimension is performed based on the groupings count unique of all variable pairs. See cardinality element of the result for cardinality matrix. Due to extensive computional processing of the function be aware it can take some time to return the results. You can use timing argument to register sub processes time or verbose to display processing messages.

See Also

joinbyv, db, timing

Examples

Run this code
suppressPackageStartupMessages(library(dwtools))

## simple 2 dimension case

X = dw.populate(N=1e5, scenario="star")
x <- joinbyv(X$SALES, join=list(X$CURRENCY, X$GEOGRAPHY))
dw <- build_hierarchy(x, factname="fact_sales")
sapply(dw$tables,ncol)
sapply(dw$tables,nrow)

## 5 dimensions

x <- dw.populate(N=1e5, scenario="denormalize")
names(x)
system.time(dw <- build_hierarchy(x,factname="fact_sales"))
# cardinality matrix, top 10x10
dw$cardinality[1:10,1:10]
# normalized tables into star schema
lapply(dw$tables,head,3)
# relation defintion, processing meta data also available
names(dw)

# better timing, setup ?db connection to automatically store logs in db
dw <- build_hierarchy(x,factname="fact_sales",timing=TRUE)
# print timing expressions and the rest of log entry
get.timing(FALSE,last=2L)[,{cat(paste0("\n# ",tag,"\n",expr),"\n",sep=""); .SD}][,.SD,.SDcols=-c("expr")]

## shiny app to browse model, will use following vars in Global Env:
# x - denormalized table
# dw - normalization results
if(interactive()) shiny::runApp(system.file("shinyBI", package="dwtools"))

## deploy to db

library(RSQLite)
sqlite1 = list(drvName="SQLite",dbname="sqlite1.db") # setup connection to db
sqlite1$conn = dbConnect(SQLite(), dbname=sqlite1$dbname)
options("dwtools.db.conns"=list(sqlite1=sqlite1))
x <- dw.populate(N=1e5, scenario="denormalize")
dw <- build_hierarchy(x,factname="fact_sales",deploy=TRUE,db.conn.name="sqlite1",timing=TRUE,verbose=1L)
get.timing()
db(c("dim_geog","fact_sales"))
dbDisconnect(sqlite1$conn); file.remove(sqlite1$dbname); options("dwtools.db.conns"=NULL); purge.timing()

## different data quality use cases tests:

# reorder columns
#setcolorder(x,sample(names(x),length(x),FALSE)) # random column order

# alter column name to disable common_words feature
#setnames(x,c("curr_code","curr_type"),c("curr1_code","curr2_type")) # no common words within fields in dimension

# broken hierarchy
#`:=`

# NAs
#`:=`

Run the code above in your browser using DataLab