na.test: Missing Completely at Random (MCAR) Test

Description

This function performs Little's Missing Completely at Random (MCAR) test and Jamshidian and Jalal's approach for testing the MCAR assumption. By default, the function performs the Jamshidian and Jalal's approach.

Usage

na.test(..., data = NULL, print = c("all", "little", "jamjal"),
        impdat = NULL, delete = 6, method = c("npar", "normal"),
        m = 20, seed = 123, nrep = 10000, n.min = 30,
        pool = c("m", "med", "min", "max", "random"),
        alpha = 0.05, digits = 2, p.digits = 3, as.na = NULL,
        write = NULL, append = TRUE, check = TRUE, output = TRUE)

Value

Returns an object of class misty.object, which is a list with following entries:

call: function call
type: type of analysis
data: matrix or data frame specified in x
args: specification of function arguments
result: list with result tables, i.e., little for the result table of the Little's MCAR test, jamjal for the list with results of the Jamshidian and Jalal's approach, hawkins for the result table of the Hawkins test, and anderson for the result table of the Anderson-Darling non-parametric test

Arguments

...: a matrix or data frame with incomplete data, where missing values are coded as NA. Alternatively, an expression indicating the variable names in data e.g., na.test(x1, x2, x3, data = dat). Note that the operators ., +, -, ~, :, ::, and ! can also be used to select variables, see 'Details' in the df.subset function.
data: a data frame when specifying one or more variables in the argument .... Note that the argument is NULL when specifying a matrix or data frame for the argument ....
print: a character vector indicating which results to be printed on the console, i.e. "all" for Little's MCAR test and Jamshidian and Jalal's approach, "little" for Little's MCAR test, and "jamjal" (default) for Jamshidian and Jalal's approach.
impdat: an object of class mids from the mice package to provide a data set multiply imputed in the mice package. The function will not impute the data data set specified in the argument data when specifying this argument and will use the imputed data sets provided in the argument impdat for performing the Jamshidian and Jalal's approach. Note that the argument data still needs to be specified because the variables used for the analysis are extracted from the data frame specified in data.
delete: an integer value indicating missing data patterns consisting of delete number of cases or less removed from the Jamshidian and Jalal's approach. The default setting is delete = 6.
method: a character string indicating the imputation method, i.e., "npar" for using a non-parametric imputation method by Sirvastava and Dolatabadi (2009) or "normal" for imputing missing data assuming that the data come from a multivariate normal distribution (see Jamshidian & Jalal, 2010).
m: an integer value indicating the number of multiple imputations. The default setting is m = 20.
seed: an integer value that is used as argument by the set.seed function for offsetting the random number generator before performing Jamshidian and Jalal's approach. The default setting is seed = 123. Set the value to NULL to specify a system selected seed.
nrep: an integer value indicating the replications used to simulate the Neyman distribution to determine the cut off value for the Neyman test. Larger values increase the accuracy of the Neyman test. The default setting is nrep = 10000.
n.min: an integer value indicating the minimum number of cases in a group that triggers the use of asymptotic Chi-square distribution in place of the empirical distribution in the Neyman test of uniformity.
pool: a character string indicating the pooling method, i.e., "m" for computing the average test statistic and p-values, "med" for computing the median test statistic and p-values, "min" for computing the maximum test statistic and minimum p-values, "max" for computing the minimum test statistic and maximum p-values, and "random" for randomly choosing a test statistic and corresponding p-value from repeated complete data analyses. The default setting is pool = "med", i.e., median test statistic and p-values are computed as suggested by Eekhout, Wiel and Heymans (2017).
alpha: a numeric value between 0 and 1 indicating the significance level of the Hawkins test. The default setting is alpha = 0.05, i.e., the Anderson-Darling non-parametric test is provided when the p-value of the Hawkins test is less than or equal 0.05.
digits: an integer value indicating the number of decimal places to be used for displaying results.
p.digits: an integer value indicating the number of decimal places to be used for displaying the p-value.
as.na: a numeric vector indicating user-defined missing values, i.e. these values are converted to NA before conducting the analysis.
write: a character string naming a text file with file extension ".txt" (e.g., "Output.txt") for writing the output into a text file.
append: logical: if TRUE (default), output will be appended to an existing text file with extension .txt specified in write, if FALSE existing text file will be overwritten.
check: logical: if TRUE (default), argument specification is checked.
output: logical: if TRUE (default), output is shown.

Author

Takuya Yanagida takuya.yanagida@univie.ac.at

Details

Little's MCAR Test

Little (1988) proposed a multivariate test of Missing Completely at Random (MCAR) that tests for mean differences on every variable in the data set across subgroups that share the same missing data pattern by comparing the observed variable means for each pattern of missing data with the expected population means estimated using the expectation-maximization (EM) algorithm (i.e., EM maximum likelihood estimates). The test statistic is the sum of the squared standardized differences between the subsample means and the expected population means weighted by the estimated variance-covariance matrix and the number of observations within each subgroup (Enders, 2010). Under the null hypothesis that data are MCAR, the test statistic follows asymptotically a chi-square distribution with \(\sum k_j - k\) degrees of freedom, where \(k_j\) is the number of complete variables for missing data pattern \(j\), and \(k\) is the total number of variables. A statistically significant result provides evidence against MCAR.

Note that Little's MCAR test has a number of problems (see Enders, 2010).

First, the test does not identify the specific variables that violates MCAR, i.e., the test does not identify potential correlates of missingness (i.e., auxiliary variables).
Second, the test is based on multivariate normality, i.e., under departure from the normality assumption the test might be unreliable unless the sample size is large and is not suitable for categorical variables.
Third, the test investigates mean differences assuming that the missing data pattern share a common covariance matrix, i.e., the test cannot detect covariance-based deviations from MCAR stemming from a Missing at Random (MAR) or Missing Not at Random (MNAR) mechanism because MAR and MNAR mechanisms can also produce missing data subgroups with equal means.
Fourth, simulation studies suggest that Little's MCAR test suffers from low statistical power, particularly when the number of variables that violate MCAR is small, the relationship between the data and missingness is weak, or the data are MNAR (Thoemmes & Enders, 2007).
Fifth, the test can only reject, but cannot prove the MCAR assumption, i.e., a statistically not significant result and failing to reject the null hypothesis of the MCAR test does not prove the null hypothesis that the data is MCAR.
Sixth, under the null hypothesis the data are actually MCAR or MNAR, while a statistically significant result indicates that missing data are MAR or MNAR, i.e., MNAR cannot be ruled out regardless of the result of the test.

The function for performing Little's MCAR test is based on the mlest function from the mvnmle package which can handle up to 50 variables. Note that the mcar_test function in the naniar package is based on the prelim.norm function from the norm package. This function can handle about 30 variables, but with more than 30 variables specified in the argument data, the prelim.norm function might run into numerical problems leading to results that are not trustworthy (i.e., p.value = 1). In that case, the warning message In norm::prelim.norm(data) : NAs introduced by coercion to integer range is printed on the console.

Jamshidian and Jalal's Approach for Testing MCAR

Jamshidian and Jalal (2010) proposed an approach for testing the Missing Completely at Random (MCAR) assumption based on two tests of multivariate normality and homogeneity of covariances among groups of cases with identical missing data patterns:

In the first step, missing data are multiply imputed (m = 20 times by default) using a non-parametric imputation method (method = "npar" by default) by Sirvastava and Dolatabadi (2009) or using a parametric imputation method assuming multivariate normality of data (method = "normal") for each group of cases sharing a common missing data pattern.
In the second step, a modified Hawkins test for multivariate normality and homogeneity of covariances applicable to complete data consisting of groups with a small number of cases is performed. A statistically not significant result indicates no evidence against multivariate normality of data or homogeneity of covariances, while a statistically significant result provides evidence against multivariate normality of data or homogeneity of covariances (i.e., violation of the MCAR assumption). Note that the Hawkins test is a test of multivariate normality as well as homogeneity of covariance. Hence, a statistically significant test is ambiguous unless the researcher assumes multivariate normality of data.
In the third step, if the Hawkins test is statistically significant, the Anderson-Darling non-parametric test is performed. A statistically not significant result indicates evidence against multivariate normality of data but no evidence against homogeneity of covariances, while a statistically significant result provides evidence against homogeneity of covariances (i.e., violation of the MCAR assumption). However, no conclusions can be made about the multivariate normality of data when the Anderson-Darling non-parametric test is statistically significant.

In summary, a statistically significant result of both the Hawkins and the Anderson-Darling non-parametric test provides evidence against the MCAR assumption. The test statistic and the significance values of the Hawkins test and the Anderson-Darling non-parametric based on multiply imputed data sets are pooled by computing the median test statistic and significance value (pool = "med" by default) as suggested by Eekhout, Wiel, and Heymans (2017).

Note that out of the problems listed for the Little's MCAR test the first, second (i.e., approach is not suitable for categorical variables), fifth, and sixth problems also apply to the Jamshidian and Jalal's approach for testing the MCAR assumption.

In practice, rejecting or not rejecting the MCAR assumption may not be relevant as modern missing data handling methods like full information maximum likelihood (FIML) estimation, Bayesian estimation, or multiple imputation are asymptotically valid under the missing at random (MAR) assumption (Jamshidian & Yuan, 2014). It is more important to distinguish MAR from missing not at random (MNAR), but MAR and MNAR mechanisms cannot be distinguished without auxiliary information.

References

Beaujean, A. A. (2012). BaylorEdPsych: R Package for Baylor University Educational Psychology Quantitative Courses. R package version 0.5. http://cran.nexr.com/web/packages/BaylorEdPsych/index.html

Eekhout, I., M. A. Wiel, & M. W. Heymans (2017). Methods for significance testing of categorical covariates in logistic regression models after multiple imputation: Power and applicability analysis. BMC Medical Research Methodology, 17:129. https://doi.org/10.1186/s12874-017-0404-7

Enders, C. K. (2010). Applied missing data analysis. Guilford Press.

Little, R. J. A. (1988). A test of Missing Completely at Random for multivariate data with missing values. Journal of the American Statistical Association, 83, 1198-1202. https://doi.org/10.2307/2290157

Jamshidian, M., & Jalal, S. (2010). Tests of homoscedasticity, normality, and missing completely at random for incomplete multivariate data. Psychometrika, 75(4), 649-674. https://doi.org/10.1007/s11336-010-9175-3

Jamshidian, M., & Yuan, K.H. (2014). Examining missing data mechanisms via homogeneity of parameters, homogeneity of distributions, and multivariate normality. WIREs Computational Statistics, 6(1), 56-73. https://doi.org/10.1002/wics.1287

Mortaza, J., Siavash, J., Camden, J., & Kobayashi, M. (2024). MissMech: Testing Homoscedasticity, Multivariate Normality, and Missing Completely at Random. R package version 1.0.4. https://doi.org/10.32614/CRAN.package.MissMech

Srivastava, M.S., & Dolatabadi, M. (2009). Multiple imputation and other resampling scheme for imputing missing observations. Journal of Multivariate Analysis, 100, 1919-1937. https://doi.org/10.1016/j.jmva.2009.06.003

Thoemmes, F., & Enders, C. K. (2007, April). A structural equation model for testing whether data are missing completely at random. Paper presented at the annual meeting of the American Educational Research Association, Chicago, IL.

Examples

Run this code

# Example 1a: Perform Little's MCAR test and Jamshidian and Jalal's approach
na.test(airquality)

# Example b: Alternative specification using the 'data' argument,
na.test(., data = airquality)

if (FALSE) {
# Example 2: Write results into a text file
na.test(airquality, write = "NA_Test.txt")
}

Run the code above in your browser using DataLab