Create missing at random (MAR) values using a ranking mechanism in a data frame or a matrix
delete_MAR_rank(
ds,
p,
cols_mis,
cols_ctrl,
n_mis_stochastic = FALSE,
ties.method = "average",
miss_cols,
ctrl_cols
)
An object of the same class as ds
with missing values.
A data frame or matrix in which missing values will be created.
A numeric vector with length one or equal to length cols_mis
;
the probability that a value is missing.
A vector of column names or indices of columns in which missing values will be created.
A vector of column names or indices of columns, which
controls the creation of missing values in cols_mis
. Must be of the
same length as cols_mis
.
Logical, should the number of missing values be
stochastic? If n_mis_stochastic = TRUE
, the number of missing values
for a column with missing values cols_mis[i]
is a random variable
with expected value nrow(ds) * p[i]
. If n_mis_stochastic =
FALSE
, the number of missing values will be deterministic. Normally, the
number of missing values for a column with missing values
cols_mis[i]
is round(nrow(ds) * p[i])
. Possible deviations
from this value, if any exists, are documented in Details.
How ties are handled. Passed to rank
.
Deprecated, use cols_mis
instead.
Deprecated, use cols_ctrl
instead.
This function creates missing at random (MAR) values in the columns
specified by the argument cols_mis
.
The probability for missing values is controlled by p
.
If p
is a single number, then the overall probability for a value to
be missing will be p
in all columns of cols_mis
.
(Internally p
will be replicated to a vector of the same length as
cols_mis
.
So, all p[i]
in the following sections will be equal to the given
single number p
.)
Otherwise, p
must be of the same length as cols_mis
.
In this case, the overall probability for a value to be missing will be
p[i]
in the column cols_mis[i]
.
The position of the missing values in cols_mis[i]
is controlled by
cols_ctrl[i]
.
The following procedure is applied for each pair of cols_ctrl[i]
and
cols_mis[i]
to determine the positions of missing values:
At first, the probability for a value to be missing is calculated. This
probability for a missing value in a row of cols_mis[i]
is
proportional to the rank of the value in cols_ctrl[i]
in the same row.
If n_mis_stochastic = FALSE
these probabilities are given to the
prob
argument of sample
. If n_mis_stochastic
= TRUE
, they are scaled to sum up to nrow(ds) * p[i]
. Then for each
probability a uniformly distributed random number is generated. If this
random number is less than the probability, the value in cols_mis[i]
is set NA
.
The ranks are calculated via rank
.
The argument ties.method
is directly passed to this function.
Possible choices for ties.method
are documented in
rank
.
For high values of p
it is mathematically not possible to get
probabilities proportional to the ranks. In this case, a warning is given.
This warning can be silenced by setting the option
missMethods.warn.too.high.p
to false.
Santos, M. S., Pereira, R. C., Costa, A. F., Soares, J. P., Santos, J., & Abreu, P. H. (2019). Generating Synthetic Missing Data: A Review by Missing Mechanism. IEEE Access, 7, 11651-11667
rank
, delete_MNAR_rank
Other functions to create MAR:
delete_MAR_1_to_x()
,
delete_MAR_censoring()
,
delete_MAR_one_group()
ds <- data.frame(X = 1:20, Y = 101:120)
delete_MAR_rank(ds, 0.2, "X", "Y")
Run the code above in your browser using DataLab