Create missing completely at random (MCAR) values in a data frame or a matrix
delete_MCAR(
ds,
p,
cols_mis = seq_len(ncol(ds)),
n_mis_stochastic = FALSE,
p_overall = FALSE,
miss_cols,
stochastic
)
An object of the same class as ds
with missing values.
A data frame or matrix in which missing values will be created.
A numeric vector with length one or equal to length cols_mis
;
the probability that a value is missing.
A vector of column names or indices of columns in which missing values will be created.
Logical, should the number of missing values be
stochastic? If n_mis_stochastic = TRUE
, the number of missing values
for a column with missing values cols_mis[i]
is a random variable
with expected value nrow(ds) * p[i]
. If n_mis_stochastic =
FALSE
, the number of missing values will be deterministic. Normally, the
number of missing values for a column with missing values
cols_mis[i]
is round(nrow(ds) * p[i])
. Possible deviations
from this value, if any exists, are documented in Details.
Logical; see details.
Deprecated, use cols_mis
instead.
Deprecated, use n_mis_stochastic
instead.
This function creates missing completely at random (MCAR) values in
the dataset ds
.
The proportion of missing values is specified with p
.
The columns in which missing values are created can be set via cols_mis
.
If cols_mis
is not specified, then missing values are created in
all columns of ds
.
The probability for missing values is controlled by p
. If p
is
a single number, then the overall probability for a value to be missing will
be p
in all columns of cols_mis
. (Internally p
will be
replicated to a vector of the same length as cols_mis
. So, all
p[i]
in the following sections will be equal to the given single
number p
.) Otherwise, p
must be of the same length as
cols_mis
. In this case, the overall probability for a value to be
missing will be p[i]
in the column cols_mis[i]
.
If n_mis_stochastic = FALSE
and p_overall = FALSE
(the default), then
exactly round(nrow(ds) * p[i])
values will be set NA
in column
cols_mis[i]
. If n_mis_stochastic = FALSE
and p_overall =
TRUE
, then p
must be of length one and exactly round(nrow(ds) *
p * length(cols_mis))
values will be set NA (over all columns in
cols_mis
). This can result in a proportion of missing values in every
miss_col
unequal to p
, but the proportion of missing values in
all columns together will be close to p
.
If n_mis_stochastic = TRUE
, then each value in column
cols_mis[i]
has probability p[i]
to be missing (independently
of all other values). Therefore, the number of missing values in
cols_mis[i]
is a random variable with a binomial distribution
B(nrow(ds)
, p[i]
). This can (and will most of the time)
lead to more or less missing values than round(nrow(ds) * p[i])
in
column cols_mis[i]
. If n_mis_stochastic = TRUE
, then the
argument p_overall
is ignored because it is superfluous.
Santos, M. S., Pereira, R. C., Costa, A. F., Soares, J. P., Santos, J., & Abreu, P. H. (2019). Generating Synthetic Missing Data: A Review by Missing Mechanism. IEEE Access, 7, 11651-11667
ds <- data.frame(X = 1:20, Y = 101:120)
delete_MCAR(ds, 0.2)
Run the code above in your browser using DataLab