Learn R Programming

missMethods (version 0.2.0)

delete_MAR_one_group: Create MAR values by deleting values in one of two groups

Description

Create missing at random (MAR) values by deleting values in one of two groups in a data frame or a matrix

Usage

delete_MAR_one_group(
  ds,
  p,
  cols_mis,
  cols_ctrl,
  cutoff_fun = median,
  prop = 0.5,
  use_lpSolve = TRUE,
  ordered_as_unordered = FALSE,
  stochastic = FALSE,
  ...,
  miss_cols,
  ctrl_cols
)

Arguments

ds

A data frame or matrix in which missing values will be created.

p

A numeric vector with length one or equal to length cols_mis; the probability that a value is missing.

cols_mis

A vector of column names or indices of columns in which missing values will be created.

cols_ctrl

A vector of column names or indices of columns, which controls the creation of missing values in cols_mis. Must be of the same length as cols_mis.

cutoff_fun

Function that calculates the cutoff values in the cols_ctrl.

prop

Numeric of length one; (minimum) proportion of rows in group 1 (only used for unordered factors).

use_lpSolve

Logical; should lpSolve be used for the determination of groups, if cols_ctrl[i] is an unordered factor.

ordered_as_unordered

Logical; should ordered factors be treated as unordered factors.

stochastic

Logical; see details.

...

Further arguments passed to cutoff_fun.

miss_cols

Deprecated, use cols_mis instead.

ctrl_cols

Deprecated, use cols_ctrl instead.

Value

An object of the same class as ds with missing values.

Treatment of factors

If ds[, cols_ctrl[i]] is an unordered factor, then the concept of a cutoff value is not meaningful and cannot be applied. Instead, a combinations of the levels of the unordered factor is searched that

  • guarantees at least a proportion of prop rows are in group 1

  • minimize the difference between prop and the proportion of rows in group 1.

This can be seen as a binary search problem, which is solved by the solver from the package lpSolve, if use_lpSolve = TRUE. If use_lpSolve = FALSE, a very simple heuristic is applied. The heuristic only guarantees that at least a proportion of prop rows are in group 1. The choice use_lpSolve = FALSE is not recommend and should only be considered, if the solver of lpSolve fails.

If ordered_as_unordered = TRUE, then ordered factors will be treated like unordered factors and the same binary search problem will be solved for both types of factors. If ordered_as_unordered = FALSE (the default), then ordered factors will be grouped via cutoff_fun as described in Details.

Details

This function creates missing at random (MAR) values in the columns specified by the argument cols_mis. The probability for missing values is controlled by p. If p is a single number, then the overall probability for a value to be missing will be p in all columns of cols_mis. (Internally p will be replicated to a vector of the same length as cols_mis. So, all p[i] in the following sections will be equal to the given single number p.) Otherwise, p must be of the same length as cols_mis. In this case, the overall probability for a value to be missing will be p[i] in the column cols_mis[i]. The position of the missing values in cols_mis[i] is controlled by cols_ctrl[i]. The following procedure is applied for each pair of cols_ctrl[i] and cols_mis[i] to determine the positions of missing values:

At first, the rows of ds are divided into two groups. Therefore, the cutoff_fun calculates a cutoff value for cols_ctrl[i] (via cutoff_fun(ds[, cols_ctrl[i]], ...). The group 1 consists of the rows, whose values in cols_ctrl[i] are below the calculated cutoff value. If the so defined group 1 is empty, the rows that are equal to the cutoff value will be added to this group (otherwise, these rows will belong to group 2). The group 2 consists of the remaining rows, which are not part of group 1. Now one of these two groups is chosen randomly. In the chosen group, values are deleted in cols_mis[i]. In the other group, no missing values will be created in cols_mis[i].

If stochastic = FALSE (the default), then floor(nrow(ds) * p[i]) or ceiling(nrow(ds) * p[i]) values will be set NA in column cols_mis[i] (depending on the grouping). If stochastic = TRUE, each value in the group with missing values will have a probability to be missing, to meet a proportion of p[i] of missing values in cols_mis[i] in expectation. The effect of stochastic is further discussed in delete_MCAR.

References

Santos, M. S., Pereira, R. C., Costa, A. F., Soares, J. P., Santos, J., & Abreu, P. H. (2019). Generating Synthetic Missing Data: A Review by Missing Mechanism. IEEE Access, 7, 11651-11667

See Also

delete_MNAR_one_group

Other functions to create MAR: delete_MAR_1_to_x(), delete_MAR_censoring(), delete_MAR_rank()

Examples

Run this code
# NOT RUN {
ds <- data.frame(X = 1:20, Y = 101:120)
delete_MAR_one_group(ds, 0.2, "X", "Y")
# }

Run the code above in your browser using DataLab