Learn R Programming

etl (version 0.4.1)

etl_cleanup: ETL functions for working with medium sized data

Description

These generic functions provide a systematic approach for performing ETL (exchange-transform-load) operations on medium sized data.

Usage

etl_cleanup(obj, ...)

# S3 method for default etl_cleanup( obj, delete_raw = FALSE, delete_load = FALSE, pattern = "\\.(csv|zip)$", ... )

etl_create(obj, ...)

# S3 method for default etl_create(obj, ...)

etl_update(obj, ...)

# S3 method for default etl_update(obj, ...)

etl_extract(obj, ...)

# S3 method for default etl_extract(obj, ...)

# S3 method for etl_mtcars etl_extract(obj, ...)

# S3 method for etl_cities etl_extract(obj, ...)

etl_load(obj, ...)

# S3 method for default etl_load(obj, ...)

etl_transform(obj, ...)

# S3 method for default etl_transform(obj, ...)

# S3 method for etl_cities etl_transform(obj, ...)

Value

Each one of these functions returns an etl object, invisibly.

Arguments

obj

an etl object

...

arguments passed to methods

delete_raw

should files be deleted from the raw_dir?

delete_load

should files be deleted from the load_dir?

pattern

regular expression matching file names to be deleted. By default, this matches filenames ending in .csv and .zip.

Details

The purposes of these functions are to download data from a particular data source from the Internet, process it, and load it into a SQL database server.

There are five primary functions:

etl_init

Initialize the database schema.

etl_extract

Download data from the Internet and store it locally in its raw form.

etl_transform

Manipulate the raw data such that it can be loaded into a database table. Usually, this means converting the raw data to (a series of) CSV files, which are also stored locally.

etl_load

Load the transformed data into the database.

etl_cleanup

Perform housekeeping, such as deleting unnecessary raw data files.

Additionally, two convenience functions chain these operations together:

etl_create

Run all five functions in succession. This is useful when you want to create the database from scratch.

etl_update

Run the etl_extract-etl_transform-etl_load functions in succession. This is useful when the database already exists, but you want to insert some new data.

See Also

etl, etl_init

Examples

Run this code

if (FALSE) {
if (require(RPostgreSQL)) {
  db <- src_postgres(dbname = "mtcars", user = "postgres", host = "localhost")
  cars <- etl("mtcars", db)
}
if (require(RMySQL) && mysqlHasDefault()) {
  db <- src_mysql(dbname = "mtcars", user = "r-user",
                  host = "localhost", password = "mypass")
  cars <- etl("mtcars", db)
}
}
cars <- etl("mtcars")
cars %>%
 etl_extract() %>%
 etl_transform() %>%
 etl_load() %>%
 etl_cleanup()
cars

cars %>%
 tbl(from = "mtcars") %>%
 group_by(cyl) %>%
 summarise(N = n(), mean_mpg = mean(mpg))

 # do it all in one step, and peek at the SQL creation script
 cars %>%
   etl_create(echo = TRUE)
 # specify a directory for the data
 if (FALSE) {
 cars <- etl("mtcars", dir = "~/dumps/mtcars/")
 str(cars)
 }
cars <- etl("mtcars")
# Do it step-by-step
cars %>%
  etl_extract() %>%
  etl_transform() %>%
  etl_load()

# Note the somewhat imprecise data types for the columns. These are the default.
tbl(cars, "mtcars")

# But you can also specify your own schema if you want
schema <- system.file("sql", "init.sqlite", package = "etl")
cars %>%
  etl_init(schema) %>%
  etl_load()

Run the code above in your browser using DataLab