delete_MAR_censoring: Create MAR values using a censoring mechanism

Description

Create missing at random (MAR) values using a censoring mechanism in a data frame or a matrix

Usage

delete_MAR_censoring(
  ds,
  p,
  cols_mis,
  cols_ctrl,
  where = "lower",
  sorting = TRUE,
  miss_cols,
  ctrl_cols
)

Arguments

A data frame or matrix in which missing values will be created.

A numeric vector with length one or equal to length cols_mis; the probability that a value is missing.

cols_mis

A vector of column names or indices of columns in which missing values will be created.

cols_ctrl

A vector of column names or indices of columns, which controls the creation of missing values in cols_mis. Must be of the same length as cols_mis.

where

Controls where missing values are created; one of "lower", "upper" or "both" (see details).

sorting

Logical; should sorting be used or a quantile as a threshold.

miss_cols

Deprecated, use cols_mis instead.

ctrl_cols

Deprecated, use cols_ctrl instead.

Value

An object of the same class as ds with missing values.

Details

This function creates missing at random (MAR) values in the columns specified by the argument cols_mis. The probability for missing values is controlled by p. If p is a single number, then the overall probability for a value to be missing will be p in all columns of cols_mis. (Internally p will be replicated to a vector of the same length as cols_mis. So, all p[i] in the following sections will be equal to the given single number p.) Otherwise, p must be of the same length as cols_mis. In this case, the overall probability for a value to be missing will be p[i] in the column cols_mis[i]. The position of the missing values in cols_mis[i] is controlled by cols_ctrl[i]. The following procedure is applied for each pair of cols_ctrl[i] and cols_mis[i] to determine the positions of missing values:

If sorting = TRUE (the default), the column cols_ctrl[i] will be sorted. Then the rows with the round(nrow(ds) * p[i]) smallest values will be selected (if where = "lower" (the default)). Now missing values will be created in the column cols_mis[i] in these rows. This effectively censors the proportion of p[i] rows of smallest values in cols_ctrl[i] in cols_mis[i].

If where = "upper", instead of the rows with the smallest values, the rows with the highest values will be selected. For where = "both", the one half of the round(nrow(ds) * p[i]) rows with missing values will be the rows with the smallest values and the other half will be the rows with the highest values. So the censoring rows are dived to the highest and smallest values of cols_ctrl[i].

If sorting = FALSE, the rows of ds will not be sorted. Instead, a quantile will be calculated (using quantile). If where = "lower", the quantile(ds[, cols_ctrl[i]], p[i]) will be calculated and all rows with values in ds[, cols_ctrl[i]] below this quantile will have missing values in cols_mis[i]. For where = "upper", the quantile(ds[, cols_ctrl[i]], 1 - p[i]) will be calculated and all rows with values above this quantile will have missing values. For where = "both", the quantile(ds[, cols_ctrl[i]], p[i] / 2) and quantile(ds[, cols_ctrl[i]], 1 - p[i] / 2) will be calculated. All rows with values in cols_ctrl[i] below the first quantile or above the second quantile will have missing values in cols_mis[i].

The option sorting = TRUE will always create exactly round(nrow(ds) * p[i]) missing values in cols_mis[i]. For sorting = FALSE, the number of missing values will normally be close to nrow(ds) * p[i]. But for cols_ctrl with many duplicates the choice sorting = FALSE can be problematic, because of the calculation of quantile(ds[, cols_ctrl[i]], p[i]) and setting values NA below this threshold (see examples). So, in most cases sorting = TRUE is recommended.

References

Santos, M. S., Pereira, R. C., Costa, A. F., Soares, J. P., Santos, J., & Abreu, P. H. (2019). Generating Synthetic Missing Data: A Review by Missing Mechanism. IEEE Access, 7, 11651-11667

Examples

Run this code

# NOT RUN {
ds <- data.frame(X = 1:20, Y = 101:120)
delete_MAR_censoring(ds, 0.2, "X", "Y")
# many dupplicated values can be problematic for sorting = FALSE:
ds_many_dup <- data.frame(X = 1:20, Y = c(rep(0, 10), rep(1, 10)))
delete_MAR_censoring(ds_many_dup, 0.2, "X", "Y") # 4 NAs as expected
quantile(ds_many_dup$Y, 0.2) # 0
# No value is BELOW 0 in ds_many_dup$Y, so no missing values will be created:
delete_MAR_censoring(ds_many_dup, 0.2, "X", "Y", sorting = FALSE) # No NA!
# }

Run the code above in your browser using DataLab