correlate.tbl_dbi: Compute the correlation coefficient between two numerical data

Description

The correlate() compute Pearson's the correlation coefficient of the numerical(INTEGER, NUMBER, etc.) column of the DBMS table through tbl_dbi.

Usage

# S3 method for tbl_dbi
correlate(
  .data,
  ...,
  in_database = FALSE,
  collect_size = Inf,
  method = c("pearson", "kendall", "spearman")
)

Arguments

.data

a tbl_dbi.

...

one or more unquoted expressions separated by commas. You can treat variable names like they are positions. Positive values select variables; negative values to drop variables. If the first expression is negative, correlate() will automatically start with all variables. These arguments are automatically quoted and evaluated in a context where column names represent column positions. They support unquoting and splicing.

in_database

Specifies whether to perform in-database operations. If TRUE, most operations are performed in the DBMS. if FALSE, table data is taken in R and operated in-memory. Not yet supported in_database = TRUE.

collect_size

a integer. The number of data samples from the DBMS to R. Applies only if in_database = FALSE.

method

a character string indicating which correlation coefficient (or covariance) is to be computed. One of "pearson" (default), "kendall", or "spearman": can be abbreviated.

See vignette("EDA") for an introduction to these concepts.

Correlation coefficient information

The information derived from the numerical data compute is as follows.

var1 : names of numerical variable
var2 : name of the corresponding numeric variable
coef_corr : Pearson's correlation coefficient

Details

This function is useful when used with the group_by() function of the dplyr package. If you want to compute by level of the categorical data you are interested in, rather than the whole observation, you can use grouped_df as the group_by() function. This function is computed stats::cor() function by use = "pairwise.complete.obs" option.

Examples

Run this code

# NOT RUN {
library(dplyr)

# connect DBMS
con_sqlite <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

# copy heartfailure to the DBMS with a table named TB_HEARTFAILURE
copy_to(con_sqlite, heartfailure, name = "TB_HEARTFAILURE", overwrite = TRUE)

# Using pipes ---------------------------------
# Correlation coefficients of all numerical variables
con_sqlite %>% 
  tbl("TB_HEARTFAILURE") %>% 
  correlate()
 
# Positive values select variables
con_sqlite %>% 
  tbl("TB_HEARTFAILURE") %>% 
  correlate(platelets, sodium)
 
# Negative values to drop variables, and In-memory mode and collect size is 200
con_sqlite %>% 
  tbl("TB_HEARTFAILURE") %>% 
  correlate(-platelets, -sodium, collect_size = 200)
 
# Positions values select variables
con_sqlite %>% 
  tbl("TB_HEARTFAILURE") %>% 
  correlate(1)
 
# Positions values select variables
con_sqlite %>% 
  tbl("TB_HEARTFAILURE") %>% 
  correlate(-1, -2, -3, -5, -6)
 
# ---------------------------------------------
# Correlation coefficient
# that eliminates redundant combination of variables
con_sqlite %>% 
  tbl("TB_HEARTFAILURE") %>% 
  correlate() %>%
  filter(as.integer(var1) > as.integer(var2))

con_sqlite %>% 
  tbl("TB_HEARTFAILURE") %>% 
  correlate(platelets, sodium) %>%
  filter(as.integer(var1) > as.integer(var2))

# Using pipes & dplyr -------------------------
# Compute the correlation coefficient of creatinine variable by 'hblood_pressure'
# and 'death_event' variables. And extract only those with absolute
# value of correlation coefficient is greater than 0.2
con_sqlite %>% 
  tbl("TB_HEARTFAILURE") %>% 
  group_by(hblood_pressure, death_event) %>%
  correlate(creatinine) %>%
  filter(abs(coef_corr) >= 0.2)

# extract only those with 'hblood_pressure' variable level is "Yes",
# and compute the correlation coefficient of 'creatinine' variable
# by 'sex' and 'death_event' variables.
# And the correlation coefficient is negative and smaller than -0.3
con_sqlite %>% 
  tbl("TB_HEARTFAILURE") %>% 
  filter(hblood_pressure == "Yes") %>%
  group_by(sex, death_event) %>%
  correlate(creatinine) %>%
  filter(coef_corr < 0) %>%
  filter(abs(coef_corr) > 0.3)
 
# Disconnect DBMS   
DBI::dbDisconnect(con_sqlite)
 
# }