delete_MAR_1_to_x: Create MAR values using MAR1:x

Description

Create missing at random (MAR) values using MAR1:x in a data frame or a matrix

Usage

delete_MAR_1_to_x(
  ds,
  p,
  cols_mis,
  cols_ctrl,
  x,
  cutoff_fun = median,
  prop = 0.5,
  use_lpSolve = TRUE,
  ordered_as_unordered = FALSE,
  stochastic = FALSE,
  add_realized_x = FALSE,
  ...,
  miss_cols,
  ctrl_cols
)

Arguments

A data frame or matrix in which missing values will be created.

A numeric vector with length one or equal to length cols_mis; the probability that a value is missing.

cols_mis

A vector of column names or indices of columns in which missing values will be created.

cols_ctrl

A vector of column names or indices of columns, which controls the creation of missing values in cols_mis. Must be of the same length as cols_mis.

Numeric with length one (0 < x < Inf); odds are 1 to x for the probability of a value to be missing in group 1 against the probability of a value to be missing in group 2 (see details).

cutoff_fun

Function that calculates the cutoff values in the cols_ctrl.

prop

Numeric of length one; (minimum) proportion of rows in group 1 (only used for unordered factors).

use_lpSolve

Logical; should lpSolve be used for the determination of groups, if cols_ctrl[i] is an unordered factor.

ordered_as_unordered

Logical; should ordered factors be treated as unordered factors.

stochastic

Logical; see details.

add_realized_x

Logical; if TRUE the realized odds for cols_mis will be returned (as attribute).

...

Further arguments passed to cutoff_fun.

miss_cols

Deprecated, use cols_mis instead.

ctrl_cols

Deprecated, use cols_ctrl instead.

Value

An object of the same class as ds with missing values.

Treatment of factors

If ds[, cols_ctrl[i]] is an unordered factor, then the concept of a cutoff value is not meaningful and cannot be applied. Instead, a combinations of the levels of the unordered factor is searched that

guarantees at least a proportion of prop rows are in group 1
minimize the difference between prop and the proportion of rows in group 1.

This can be seen as a binary search problem, which is solved by the solver from the package lpSolve, if use_lpSolve = TRUE. If use_lpSolve = FALSE, a very simple heuristic is applied. The heuristic only guarantees that at least a proportion of prop rows are in group 1. The choice use_lpSolve = FALSE is not recommend and should only be considered, if the solver of lpSolve fails.

If ordered_as_unordered = TRUE, then ordered factors will be treated like unordered factors and the same binary search problem will be solved for both types of factors. If ordered_as_unordered = FALSE (the default), then ordered factors will be grouped via cutoff_fun as described in Details.

Details

This function creates missing at random (MAR) values in the columns specified by the argument cols_mis. The probability for missing values is controlled by p. If p is a single number, then the overall probability for a value to be missing will be p in all columns of cols_mis. (Internally p will be replicated to a vector of the same length as cols_mis. So, all p[i] in the following sections will be equal to the given single number p.) Otherwise, p must be of the same length as cols_mis. In this case, the overall probability for a value to be missing will be p[i] in the column cols_mis[i]. The position of the missing values in cols_mis[i] is controlled by cols_ctrl[i]. The following procedure is applied for each pair of cols_ctrl[i] and cols_mis[i] to determine the positions of missing values:

At first, the rows of ds are divided into two groups. Therefore, the cutoff_fun calculates a cutoff value for cols_ctrl[i] (via cutoff_fun(ds[, cols_ctrl[i]], ...)). The group 1 consists of the rows, whose values in cols_ctrl[i] are below the calculated cutoff value. If the so defined group 1 is empty, the rows that have a value equal to the cutoff value will be added to this group (otherwise, these rows will belong to group 2). The group 2 consists of the remaining rows, which are not part of group 1. Now the probabilities for the rows in the two groups are set in the way that the odds are 1:x against a missing value in cols_mis[i] for the rows in group 1 compared to the rows in group 2. That means, the probability for a value to be missing in group 1 divided by the probability for a value to be missing in group 2 equals 1 divided by x. For example, for two equal sized groups 1 and 2, ideally the number of NAs in group 1 divided by the number of NAs in group 2 should equal 1 divided by x. But there are some restrictions, which can lead to some deviations from the odds 1:x (see below).

If stochastic = FALSE (the default), then exactly round(nrow(ds) * p[i]) values will be set NA in column cols_mis[i]. To achieve this, it is possible that the true odds differ from 1:x. The number of observations that are deleted in group 1 and group 2 are chosen to minimize the absolute difference between the realized odds and 1:x. Furthermore, if round(nrow(ds) * p[i]) == 0, then no missing value will be created in cols_mis[i]. If stochastic = TRUE, the number of missing values in cols_mis[i] is a random variable. This random variable is a sum of two binomial distributed variables (one for group 1 and one for group 2). If p is not too high and x is not too high or to low (see below), then the odds 1:x will be met in expectation. But in a single dataset the odds will be unequal to 1:x most of the time.

If p is high and x is too high or too low, it is possible that the odds 1:x and the proportion of missing values p cannot be realized together. For example, if p[i] = 0.9, then a maximum of x = 1.25 is possible (assuming that exactly 50 % of the values are below and 50 % of the values are above the cutoff value in cols_ctrl[i]). If a combination of p and x that cannot be realized together is given to delete_MAR_1_to_x, then a warning will be generated and x will be adjusted in such a way that p can be realized as given to the function.

The argument add_realized_x controls whether the x of the realized odds are added to the return value or not. If add_realized_x = TRUE, then the realized x values for all cols_mis will be added as an attribute to the returned object. For stochastic = TRUE these realized x will differ from the given x most of the time and will change if the function is rerun without setting a seed. For stochastic = FALSE, it is also possible that the realized odds differ (see above). However, the realized odds will be constant over multiple runs.

References

Santos, M. S., Pereira, R. C., Costa, A. F., Soares, J. P., Santos, J., & Abreu, P. H. (2019). Generating Synthetic Missing Data: A Review by Missing Mechanism. IEEE Access, 7, 11651-11667

Examples

Run this code

# NOT RUN {
ds <- data.frame(X = 1:20, Y = 101:120)
delete_MAR_1_to_x(ds, 0.2, "X", "Y", 3)
# beware of small datasets and stochastic = FALSE
attr(delete_MAR_1_to_x(ds, 0.4, "X", "Y", 3, add_realized_x = TRUE), "realized_x")
attr(delete_MAR_1_to_x(ds, 0.4, "X", "Y", 4, add_realized_x = TRUE), "realized_x")
attr(delete_MAR_1_to_x(ds, 0.4, "X", "Y", 5, add_realized_x = TRUE), "realized_x")
attr(delete_MAR_1_to_x(ds, 0.4, "X", "Y", 7, add_realized_x = TRUE), "realized_x")
# p = 0.4 and 20 values -> 8 missing values, possible combinations:
# either 6 above 2 below (x = 3) or
# 7 above and 1 below (x = 7)
# Too high combination of p and x:
delete_MAR_1_to_x(ds, 0.9, "X", "Y", 3)
delete_MAR_1_to_x(ds, 0.9, "X", "Y", 3, stochastic = TRUE)
# }

Run the code above in your browser using DataLab