delete_MAR_censoring: Create MAR values using a censoring mechanism

Description

Create missing at random (MAR) values using a censoring mechanism in a data frame or a matrix

Usage

delete_MAR_censoring(
  ds,
  p,
  cols_mis,
  cols_ctrl,
  n_mis_stochastic = FALSE,
  where = "lower",
  sorting = TRUE,
  miss_cols,
  ctrl_cols
)

Value

An object of the same class as ds with missing values.

Arguments

ds: A data frame or matrix in which missing values will be created.
p: A numeric vector with length one or equal to length cols_mis; the probability that a value is missing.
cols_mis: A vector of column names or indices of columns in which missing values will be created.
cols_ctrl: A vector of column names or indices of columns, which controls the creation of missing values in cols_mis. Must be of the same length as cols_mis.
n_mis_stochastic: Logical, should the number of missing values be stochastic? If n_mis_stochastic = TRUE, the number of missing values for a column with missing values cols_mis[i] is a random variable with expected value nrow(ds) * p[i]. If n_mis_stochastic = FALSE, the number of missing values will be deterministic. Normally, the number of missing values for a column with missing values cols_mis[i] is round(nrow(ds) * p[i]). Possible deviations from this value, if any exists, are documented in Details.
where: Controls where missing values are created; one of "lower", "upper" or "both" (see details).
sorting: Logical; should sorting be used or a quantile as a threshold.
miss_cols: Deprecated, use cols_mis instead.
ctrl_cols: Deprecated, use cols_ctrl instead.

Details

This function creates missing at random (MAR) values in the columns specified by the argument cols_mis. The probability for missing values is controlled by p. If p is a single number, then the overall probability for a value to be missing will be p in all columns of cols_mis. (Internally p will be replicated to a vector of the same length as cols_mis. So, all p[i] in the following sections will be equal to the given single number p.) Otherwise, p must be of the same length as cols_mis. In this case, the overall probability for a value to be missing will be p[i] in the column cols_mis[i]. The position of the missing values in cols_mis[i] is controlled by cols_ctrl[i]. The following procedure is applied for each pair of cols_ctrl[i] and cols_mis[i] to determine the positions of missing values:

The default behavior (sorting = TRUE) of this function is to first sort the column cols_ctrl[i]. Then missing values in cols_mis[i] are created in the rows with the round(nrow(ds) * p[i]) smallest values. This censors approximately the proportion of p[i] rows of smallest values in cols_ctrl[i] in cols_mis[i]. Hence, the name of the function.

If where = "upper", instead of the rows with the smallest values, the rows with the highest values will be selected. For where = "both", the one half of the round(nrow(ds) * p[i]) rows with missing values will be the rows with the smallest values and the other half will be the rows with the highest values. So the censoring rows are dived to the highest and smallest values of cols_ctrl[i]. For odd round(nrow(ds) * p[i]) one more value is set NA among the smallest values.

If n_mis_stochastic = TRUE and sorting = TRUE the procedure is lightly altered. In this case, at first the floor(nrow(ds) * p[i]) rows with the smallest values (where = "lower") are set NA. If nrow(ds) * p[i] > floor(nrow(ds) * p[i]), the row with the next greater value will be set NA with a probability to get expected nrow(ds) * p[i] missing values. For where = "upper" this "random" missing value will be the next smallest. For where = "both" this "random" missing value will be the next greatest of the smallest values.

If sorting = FALSE, the rows of ds will not be sorted. Instead, a quantile will be calculated (using quantile). If where = "lower", the quantile(ds[, cols_ctrl[i]], p[i]) will be calculated and all rows with values in ds[, cols_ctrl[i]] below this quantile will have missing values in cols_mis[i]. For where = "upper", the quantile(ds[, cols_ctrl[i]], 1 - p[i]) will be calculated and all rows with values above this quantile will have missing values. For where = "both", the quantile(ds[, cols_ctrl[i]], p[i] / 2) and quantile(ds[, cols_ctrl[i]], 1 - p[i] / 2) will be calculated. All rows with values in cols_ctrl[i] below the first quantile or above the second quantile will have missing values in cols_mis[i].

For sorting = FALSE only n_mis_stochastic = FALSE is implemented at the moment.

The option sorting = TRUE with n_mis_stochastic = FALSE will always create exactly round(nrow(ds) * p[i]) missing values in cols_mis[i]. With n_mis_stochastic = TRUE) sorting will result in floor(nrow(ds) * p[i]) or ceiling(nrow(ds) * p[i]) missing values in cols_mis[i]. For sorting = FALSE, the number of missing values will normally be close to nrow(ds) * p[i]. But for cols_ctrl with many duplicates the choice sorting = FALSE can be problematic, because of the calculation of quantile(ds[, cols_ctrl[i]], p[i]) and setting values NA below this threshold (see examples). So, in most cases sorting = TRUE is recommended.

References

Santos, M. S., Pereira, R. C., Costa, A. F., Soares, J. P., Santos, J., & Abreu, P. H. (2019). Generating Synthetic Missing Data: A Review by Missing Mechanism. IEEE Access, 7, 11651-11667

Examples

Run this code

ds <- data.frame(X = 1:20, Y = 101:120)
delete_MAR_censoring(ds, 0.2, "X", "Y")
# many dupplicated values can be problematic for sorting = FALSE:
ds_many_dup <- data.frame(X = 1:20, Y = c(rep(0, 10), rep(1, 10)))
delete_MAR_censoring(ds_many_dup, 0.2, "X", "Y") # 4 NAs as expected
quantile(ds_many_dup$Y, 0.2) # 0
# No value is BELOW 0 in ds_many_dup$Y, so no missing values will be created:
delete_MAR_censoring(ds_many_dup, 0.2, "X", "Y", sorting = FALSE) # No NA!

Run the code above in your browser using DataLab