get_data
extracts descriptive metadata from the dataframe including information on missing data
get_data(X, matrixplot_sort = TRUE, plot_transform = TRUE)
Original dataframe with samples in rows and variables as columns. Can also use the resulting object from the clean
function
Boolean with default TRUE. If TRUE, the matrix plot will be sorted by missing/non-missing status. If FALSE, the original order of rows will be retained
Boolean with default TRUE. If TRUE, the matrix plot will plot all variables scaled (mean = 0, SD = 1). If FALSE, the matrix plot will show the variables on their original scale
Number of complete cases (samples with no missing data in any columns)
Total number of rows (samples) in the dataframe
Total number of columns (variables) in the dataframe
Correlation matrix of all variables. The correlation matrix contains Pearson correlation coefficients based on pairwise correlations between variable pairs
Total fraction of missingness expressed as a number between 0 and 1, where 1 means 100% of data is missing and 0 means there are no missing values
Fraction of missingness per variable. A (named) numeric vector of length the number of columns. Each variable missingness values are expressed as numbers between 0 and 1, where 1 means 100% of data is missing and 0 means there are no missing values
Total number of missing values in the dataframe
Number of missing values per variables in the dataframe. A (named) numeric vector of length the number of columns
Missing data pattern calculated using mice::md_pattern (see md.pattern
in the mice package)
Correlation matrix of variables vs. variables converted to boolean based on missingness status (yes/no). Point-biserial correlation coefficients based on variable pairs is obtained using complete observations in the respective variable pairs. Higher correlation coefficients can indicate MAR missingness pattern
Plot based on NA_Correlations
Small dataframe offering clues on how to set min_PDM thresholds in the next steps of the pipeline. The first column represents min_PDM thresholds, while the second column represents percentages that would be retained by setting min_PDM to the respective values. These values are the percentages of the total rows with any number of missing data (excluding complete observations), so a value of e.g. 80% would mean that 80% of rows with missing data with the most common patterns are represented in the simulation step
Character vector of variables names with missingness higher than 50%
Matrix plot where missing values are colored gray and available values are colored based on value range
Cluster plot of co-missingness. Variables demonstrating shared missingness patterns will branch at closer to the bottom of the plot, while no patterns will be represented by branches high in the plot
This function uses the original dataframe and extracts descriptive metadata including dimensions, missingness fractions overall and by variable, number of missing values overall and by variable, missing data patterns, missing data correlations and missing data visualizations
# NOT RUN {
cleaned <- clean(clindata_miss, missingness_coding = -9)
metadata <- get_data(cleaned)
metadata <- get_data(cleaned, matrixplot_sort = FALSE)
# }
Run the code above in your browser using DataLab