Create missing at random (MAR) values using a censoring mechanism in a data frame or a matrix
delete_MAR_censoring(
ds,
p,
cols_mis,
cols_ctrl,
n_mis_stochastic = FALSE,
where = "lower",
sorting = TRUE,
miss_cols,
ctrl_cols
)
An object of the same class as ds
with missing values.
A data frame or matrix in which missing values will be created.
A numeric vector with length one or equal to length cols_mis
;
the probability that a value is missing.
A vector of column names or indices of columns in which missing values will be created.
A vector of column names or indices of columns, which
controls the creation of missing values in cols_mis
. Must be of the
same length as cols_mis
.
Logical, should the number of missing values be
stochastic? If n_mis_stochastic = TRUE
, the number of missing values
for a column with missing values cols_mis[i]
is a random variable
with expected value nrow(ds) * p[i]
. If n_mis_stochastic =
FALSE
, the number of missing values will be deterministic. Normally, the
number of missing values for a column with missing values
cols_mis[i]
is round(nrow(ds) * p[i])
. Possible deviations
from this value, if any exists, are documented in Details.
Controls where missing values are created; one of "lower", "upper" or "both" (see details).
Logical; should sorting be used or a quantile as a threshold.
Deprecated, use cols_mis
instead.
Deprecated, use cols_ctrl
instead.
This function creates missing at random (MAR) values in the columns
specified by the argument cols_mis
.
The probability for missing values is controlled by p
.
If p
is a single number, then the overall probability for a value to
be missing will be p
in all columns of cols_mis
.
(Internally p
will be replicated to a vector of the same length as
cols_mis
.
So, all p[i]
in the following sections will be equal to the given
single number p
.)
Otherwise, p
must be of the same length as cols_mis
.
In this case, the overall probability for a value to be missing will be
p[i]
in the column cols_mis[i]
.
The position of the missing values in cols_mis[i]
is controlled by
cols_ctrl[i]
.
The following procedure is applied for each pair of cols_ctrl[i]
and
cols_mis[i]
to determine the positions of missing values:
The default behavior (sorting = TRUE
) of this function is to first
sort the column cols_ctrl[i]
. Then missing values in
cols_mis[i]
are created in the rows with the round(nrow(ds) *
p[i])
smallest values. This censors approximately the proportion of
p[i]
rows of smallest values in cols_ctrl[i]
in
cols_mis[i]
. Hence, the name of the function.
If where = "upper"
, instead of the rows with the smallest values, the
rows with the highest values will be selected. For where = "both"
, the
one half of the round(nrow(ds) * p[i])
rows with missing values will
be the rows with the smallest values and the other half will be the rows with
the highest values. So the censoring rows are dived to the highest and
smallest values of cols_ctrl[i]
. For odd round(nrow(ds) * p[i])
one more value is set NA
among the smallest values.
If n_mis_stochastic = TRUE
and sorting = TRUE
the procedure is
lightly altered. In this case, at first the floor(nrow(ds) * p[i])
rows with the smallest values (where = "lower"
) are set NA. If
nrow(ds) * p[i] > floor(nrow(ds) * p[i])
, the row with the next
greater value will be set NA with a probability to get expected
nrow(ds) * p[i]
missing values. For where = "upper"
this
"random" missing value will be the next smallest. For where = "both"
this "random" missing value will be the next greatest of the smallest values.
If sorting = FALSE
, the rows of ds
will not be sorted. Instead,
a quantile will be calculated (using quantile
). If
where = "lower"
, the quantile(ds[, cols_ctrl[i]], p[i])
will be
calculated and all rows with values in ds[, cols_ctrl[i]]
below this
quantile will have missing values in cols_mis[i]
. For where =
"upper"
, the quantile(ds[, cols_ctrl[i]], 1 - p[i])
will be
calculated and all rows with values above this quantile will have missing
values. For where = "both"
, the quantile(ds[, cols_ctrl[i]],
p[i] / 2)
and quantile(ds[, cols_ctrl[i]], 1 - p[i] / 2)
will be
calculated. All rows with values in cols_ctrl[i]
below the first
quantile or above the second quantile will have missing values in
cols_mis[i]
.
For sorting = FALSE
only n_mis_stochastic = FALSE
is
implemented at the moment.
The option sorting = TRUE
with n_mis_stochastic = FALSE
will
always create exactly round(nrow(ds) * p[i])
missing values in
cols_mis[i]
. With n_mis_stochastic = TRUE
) sorting will result
in floor(nrow(ds) * p[i])
or ceiling(nrow(ds) * p[i])
missing
values in cols_mis[i]
. For sorting = FALSE
, the number of
missing values will normally be close to nrow(ds) * p[i]
. But for
cols_ctrl
with many duplicates the choice sorting = FALSE
can
be problematic, because of the calculation of quantile(ds[,
cols_ctrl[i]], p[i])
and setting values NA
below this threshold (see
examples). So, in most cases sorting = TRUE
is recommended.
Santos, M. S., Pereira, R. C., Costa, A. F., Soares, J. P., Santos, J., & Abreu, P. H. (2019). Generating Synthetic Missing Data: A Review by Missing Mechanism. IEEE Access, 7, 11651-11667
delete_MNAR_censoring
Other functions to create MAR:
delete_MAR_1_to_x()
,
delete_MAR_one_group()
,
delete_MAR_rank()
ds <- data.frame(X = 1:20, Y = 101:120)
delete_MAR_censoring(ds, 0.2, "X", "Y")
# many dupplicated values can be problematic for sorting = FALSE:
ds_many_dup <- data.frame(X = 1:20, Y = c(rep(0, 10), rep(1, 10)))
delete_MAR_censoring(ds_many_dup, 0.2, "X", "Y") # 4 NAs as expected
quantile(ds_many_dup$Y, 0.2) # 0
# No value is BELOW 0 in ds_many_dup$Y, so no missing values will be created:
delete_MAR_censoring(ds_many_dup, 0.2, "X", "Y", sorting = FALSE) # No NA!
Run the code above in your browser using DataLab