get_data: Extraction of metadata from dataframes

Description

get_data extracts descriptive metadata from the dataframe including information on missing data

Usage

get_data(X, matrixplot_sort = TRUE, plot_transform = TRUE)

Arguments

Original dataframe with samples in rows and variables as columns. Can also use the resulting object from the clean function

matrixplot_sort

Boolean with default TRUE. If TRUE, the matrix plot will be sorted by missing/non-missing status. If FALSE, the original order of rows will be retained

plot_transform

Boolean with default TRUE. If TRUE, the matrix plot will plot all variables scaled (mean = 0, SD = 1). If FALSE, the matrix plot will show the variables on their original scale

Value

Complete_cases

Number of complete cases (samples with no missing data in any columns)

Rows

Total number of rows (samples) in the dataframe

Columns

Total number of columns (variables) in the dataframe

Corr_matrix

Correlation matrix of all variables. The correlation matrix contains Pearson correlation coefficients based on pairwise correlations between variable pairs

Fraction_missingness

Total fraction of missingness expressed as a number between 0 and 1, where 1 means 100% of data is missing and 0 means there are no missing values

Fraction_missingness_per_variable

Fraction of missingness per variable. A (named) numeric vector of length the number of columns. Each variable missingness values are expressed as numbers between 0 and 1, where 1 means 100% of data is missing and 0 means there are no missing values

Total_NA

Total number of missing values in the dataframe

NA_per_variable

Number of missing values per variables in the dataframe. A (named) numeric vector of length the number of columns

MD_Pattern

Missing data pattern calculated using mice::md_pattern (see md.pattern in the mice package)

NA_Correlations

Correlation matrix of variables vs. variables converted to boolean based on missingness status (yes/no). Point-biserial correlation coefficients based on variable pairs is obtained using complete observations in the respective variable pairs. Higher correlation coefficients can indicate MAR missingness pattern

NA_Correlation_plot

Plot based on NA_Correlations

min_PDM_thresholds

Small dataframe offering clues on how to set min_PDM thresholds in the next steps of the pipeline. The first column represents min_PDM thresholds, while the second column represents percentages that would be retained by setting min_PDM to the respective values. These values are the percentages of the total rows with any number of missing data (excluding complete observations), so a value of e.g. 80% would mean that 80% of rows with missing data with the most common patterns are represented in the simulation step

Vars_above_half

Character vector of variables names with missingness higher than 50%

Matrix_plot

Matrix plot where missing values are colored gray and available values are colored based on value range

Cluster_plot

Cluster plot of co-missingness. Variables demonstrating shared missingness patterns will branch at closer to the bottom of the plot, while no patterns will be represented by branches high in the plot

Details

This function uses the original dataframe and extracts descriptive metadata including dimensions, missingness fractions overall and by variable, number of missing values overall and by variable, missing data patterns, missing data correlations and missing data visualizations

Examples

Run this code

# NOT RUN {
cleaned <- clean(clindata_miss, missingness_coding = -9)
metadata <- get_data(cleaned)
metadata <- get_data(cleaned, matrixplot_sort = FALSE)

# }

Run the code above in your browser using DataLab