Create missing completely at random (MCAR) values in a data frame or a matrix
delete_MCAR(
ds,
p,
cols_mis = seq_len(ncol(ds)),
stochastic = FALSE,
p_overall = FALSE,
miss_cols
)
A data frame or matrix in which missing values will be created.
A numeric vector with length one or equal to length cols_mis
;
the probability that a value is missing.
A vector of column names or indices of columns in which missing values will be created.
Logical; see details.
Logical; see details.
Deprecated, use cols_mis instead.
An object of the same class as ds
with missing values.
This function creates missing completely at random (MCAR) values in
the dataset ds
.
The proportion of missing values is specified with p
.
The columns in which missing values are created can be set via cols_mis
.
If cols_mis
is not specified, then missing values are created in
every column.
The probability for missing values is controlled by p
. If p
is
a single number, then the overall probability for a value to be missing will
be p
in all columns of cols_mis
. (Internally p
will be
replicated to a vector of the same length as cols_mis
. So, all
p[i]
in the following sections will be equal to the given single
number p
.) Otherwise, p
must be of the same length as
cols_mis
. In this case, the overall probability for a value to be
missing will be p[i]
in the column cols_mis[i]
.
If stochastic = FALSE
and p_overall = FALSE
(the default), then
exactly round(nrow(ds) * p[i])
values will be set NA
in column
cols_mis[i]
. If stochastic = FALSE
and p_overall =
TRUE
, then p
must be of length one and exactly round(nrow(ds) *
p * length(cols_mis))
values will be set NA (over all columns in
cols_mis
). This can result in a proportion of missing values in every
miss_col
unequal to p
, but the proportion of missing values in
all columns together will be close to p
.
If stochastic = TRUE
, then each value in column cols_mis[i]
has the probability p[i]
to be missing. In this case, the number of
missing values in cols_mis[i]
is a random variable with a binomial
distribution B(nrow(ds)
, p[i]
). This can (and will most
of the time) lead to more or less missing values than
round(nrow(ds) * p[i])
in each column. If stochastic = TRUE
,
then the argument p_overall
is ignored because it is superfluous.
Santos, M. S., Pereira, R. C., Costa, A. F., Soares, J. P., Santos, J., & Abreu, P. H. (2019). Generating Synthetic Missing Data: A Review by Missing Mechanism. IEEE Access, 7, 11651-11667
# NOT RUN {
ds <- data.frame(X = 1:20, Y = 101:120)
delete_MCAR(ds, 0.2)
# }
Run the code above in your browser using DataLab