Learn R Programming

cellWise (version 2.5.3)

checkDataSet: Clean the dataset

Description

This function checks the dataset X, and sets aside certain columns and rows that do not satisfy the conditions. It is used by the DDC and MacroPCA functions but can be used by itself, to clean a dataset for a different type of analysis.

Usage

checkDataSet(X, fracNA = 0.5, numDiscrete = 3, precScale = 1e-12, silent = FALSE,
cleanNAfirst = "automatic")

Value

A list with components:

  • colInAnalysis
    Column indices of the columns used in the analysis.

  • rowInAnalysis
    Row indices of the rows used in the analysis.

  • namesNotNumeric
    Names of the variables which are not numeric.

  • namesCaseNumber
    The name of the variable(s) which contained the case numbers and was therefore removed.

  • namesNAcol
    Names of the columns left out due to too many NA's.

  • namesNArow
    Names of the rows left out due to too many NA's.

  • namesDiscrete
    Names of the discrete variables.

  • namesZeroScale
    Names of the variables with zero scale.

  • remX
    Remaining (cleaned) data after checkDataSet.

Arguments

X

X is the input data, and must be an \(n\) by \(d\) matrix or data frame.

fracNA

Only retain columns and rows with fewer NAs than this fraction. Defaults to \(0.5\).

numDiscrete

A column that takes on numDiscrete or fewer values will be considered discrete and not retained in the cleaned data. Defaults to \(3\).

precScale

Only consider columns whose scale is larger than precScale. Here scale is measured by the median absolute deviation. Defaults to \(1e-12\).

silent

Whether or not the function progress messages should be printed. Defaults to FALSE.

cleanNAfirst

If "columns", first columns then rows are checked for NAs. If "rows", first rows then columns are checked for NAs. "automatic" checks columns first if \(d \geq 5n\) and rows first otherwise. Defaults to "automatic".

Author

Rousseeuw P.J., Van den Bossche W.

References

Rousseeuw, P.J., Van den Bossche W. (2018). Detecting Deviating Data Cells. Technometrics, 60(2), 135-145. (link to open access pdf)

See Also

DDC, MacroPCA, transfo, wrap

Examples

Run this code
library(MASS) 
set.seed(12345) 
n <- 100; d = 10
A <- matrix(0.9, d, d); diag(A) = 1
x <- mvrnorm(n, rep(0,d), A)
x[sample(1:(n * d), 100, FALSE)] <- NA
x <- cbind(1:n, x)
checkedx <- checkDataSet(x)

# For more examples, we refer to the vignette:
if (FALSE) {
vignette("DDC_examples")
}

Run the code above in your browser using DataLab