corr: Correlation table

Description

This function correlates a whole dataframe, running one hot smart encoding (ohse) to transform non-numerical features. Note that it will automatically suppress columns with less than 3 non missing values and warn the user.

Usage

corr(
  df,
  method = "pearson",
  use = "pairwise.complete.obs",
  pvalue = FALSE,
  padjust = NULL,
  half = FALSE,
  dec = 6,
  ignore = NULL,
  dummy = TRUE,
  redundant = NULL,
  logs = FALSE,
  limit = 10,
  top = NA,
  ...
)

Value

data.frame. Squared dimensions (N x N) to match every correlation between every df data.frame column/variable. Notice that when using ohse() you may get more dimensions.

Arguments

df: Dataframe. It doesn't matter if it's got non-numerical columns: they will be filtered.
method: Character. Any of: c("pearson", "kendall", "spearman").
use: Character. Method for computing covariances in the presence of missing values. Check stats::cor for options.
pvalue: Boolean. Returns a list, with correlations and statistical significance (p-value) for each value.
padjust: Character. NULL to skip or any of p.adjust.methods to calculate adjust p-values for multiple comparisons using p.adjust().
half: Boolean. Return only half of the matrix? The redundant symmetrical correlations will be NA.
dec: Integer. Number of decimals to round correlations and p-values.
ignore: Vector or character. Which column should be ignored?
dummy: Boolean. Should One Hot (Smart) Encoding (ohse()) be applied to categorical columns?
redundant: Boolean. Should we keep redundant columns? i.e. If the column only has two different values, should we keep both new columns? Is set to NULL, only binary variables will dump redundant columns.
logs: Boolean. Calculate log(x)+1 for numerical columns?
limit: Integer. Limit one hot encoding to the n most frequent values of each column. Set to NA to ignore argument.
top: Integer. Select top N most relevant variables? Filtered and sorted by mean of each variable's correlations.
...: Additional parameters passed to ohse, corr, and/or cor.test.

Examples

Run this code

data(dft) # Titanic dataset
df <- dft[, 2:5]

# Correlation matrix (without redundancy)
corr(df, half = TRUE)

# Ignore specific column
corr(df, ignore = "Pclass")

# Calculate p-values as well
corr(df, pvalue = TRUE, limit = 1)

# Test when no more than 2 non-missing values
df$trash <- c(1, rep(NA, nrow(df) - 1))
# and another method...
corr(df, method = "spearman")

Run the code above in your browser using DataLab

Description

Usage

Value

Arguments

See Also

Examples