Create missing at random (MAR) values using a censoring mechanism in a data frame or a matrix
delete_MAR_censoring(
ds,
p,
cols_mis,
cols_ctrl,
where = "lower",
sorting = TRUE,
miss_cols,
ctrl_cols
)
A data frame or matrix in which missing values will be created.
A numeric vector with length one or equal to length cols_mis
;
the probability that a value is missing.
A vector of column names or indices of columns in which missing values will be created.
A vector of column names or indices of columns, which
controls the creation of missing values in cols_mis
. Must be of the
same length as cols_mis
.
Controls where missing values are created; one of "lower", "upper" or "both" (see details).
Logical; should sorting be used or a quantile as a threshold.
Deprecated, use cols_mis instead.
Deprecated, use cols_ctrl instead.
An object of the same class as ds
with missing values.
This function creates missing at random (MAR) values in the columns
specified by the argument cols_mis
.
The probability for missing values is controlled by p
.
If p
is a single number, then the overall probability for a value to
be missing will be p
in all columns of cols_mis
.
(Internally p
will be replicated to a vector of the same length as
cols_mis
.
So, all p[i]
in the following sections will be equal to the given
single number p
.)
Otherwise, p
must be of the same length as cols_mis
.
In this case, the overall probability for a value to be missing will be
p[i]
in the column cols_mis[i]
.
The position of the missing values in cols_mis[i]
is controlled by
cols_ctrl[i]
.
The following procedure is applied for each pair of cols_ctrl[i]
and
cols_mis[i]
to determine the positions of missing values:
If sorting = TRUE
(the default), the column
cols_ctrl[i]
will be sorted. Then the rows with the
round(nrow(ds) * p[i])
smallest values will be selected (if
where = "lower"
(the default)). Now missing values will be created in
the column cols_mis[i]
in these rows. This effectively censors the
proportion of p[i]
rows of smallest values in cols_ctrl[i]
in
cols_mis[i]
.
If where = "upper"
, instead of the rows with the smallest values, the
rows with the highest values will be selected. For where = "both"
, the
one half of the round(nrow(ds) * p[i])
rows with missing values will
be the rows with the smallest values and the other half will be the rows with
the highest values. So the censoring rows are dived to the highest and
smallest values of cols_ctrl[i]
.
If sorting = FALSE
, the rows of ds
will not be sorted. Instead,
a quantile will be calculated (using quantile
). If
where = "lower"
, the quantile(ds[, cols_ctrl[i]], p[i])
will be
calculated and all rows with values in ds[, cols_ctrl[i]]
below this
quantile will have missing values in cols_mis[i]
. For where =
"upper"
, the quantile(ds[, cols_ctrl[i]], 1 - p[i])
will be
calculated and all rows with values above this quantile will have missing
values. For where = "both"
, the quantile(ds[, cols_ctrl[i]],
p[i] / 2)
and quantile(ds[, cols_ctrl[i]], 1 - p[i] / 2)
will be
calculated. All rows with values in cols_ctrl[i]
below the first
quantile or above the second quantile will have missing values in
cols_mis[i]
.
The option sorting = TRUE
will always create exactly
round(nrow(ds) * p[i])
missing values in cols_mis[i]
. For
sorting = FALSE
, the number of missing values will normally be close
to nrow(ds) * p[i]
. But for cols_ctrl
with many duplicates the
choice sorting = FALSE
can be problematic, because of the calculation
of quantile(ds[, cols_ctrl[i]], p[i])
and setting values NA
below this threshold (see examples). So, in most cases sorting = TRUE
is recommended.
Santos, M. S., Pereira, R. C., Costa, A. F., Soares, J. P., Santos, J., & Abreu, P. H. (2019). Generating Synthetic Missing Data: A Review by Missing Mechanism. IEEE Access, 7, 11651-11667
Other functions to create MAR:
delete_MAR_1_to_x()
,
delete_MAR_one_group()
,
delete_MAR_rank()
# NOT RUN {
ds <- data.frame(X = 1:20, Y = 101:120)
delete_MAR_censoring(ds, 0.2, "X", "Y")
# many dupplicated values can be problematic for sorting = FALSE:
ds_many_dup <- data.frame(X = 1:20, Y = c(rep(0, 10), rep(1, 10)))
delete_MAR_censoring(ds_many_dup, 0.2, "X", "Y") # 4 NAs as expected
quantile(ds_many_dup$Y, 0.2) # 0
# No value is BELOW 0 in ds_many_dup$Y, so no missing values will be created:
delete_MAR_censoring(ds_many_dup, 0.2, "X", "Y", sorting = FALSE) # No NA!
# }
Run the code above in your browser using DataLab