missing.compositions: The policy of treatment of missing values in the "compositions" package

Description

This help section discusses some general strategies of working with missing valuess in a compositional, relative or vectorial context and shows how the various types of missings are represented and treated in the "compositions" package, according to each strategy/class of analysis of compositions or amounts.

Usage

is.BDL(x,mc=attr(x,"missingClassifier"))
is.SZ(x,mc=attr(x,"missingClassifier"))
is.MAR(x,mc=attr(x,"missingClassifier"))
is.MNAR(x,mc=attr(x,"missingClassifier"))
is.NMV(x,mc=attr(x,"missingClassifier"))
is.WMNAR(x,mc=attr(x,"missingClassifier"))
is.WZERO(x,mc=attr(x,"missingClassifier"))
has.missings(x,...)
# S3 method for default
has.missings(x,mc=attr(x,"missingClassifier"),...)
# S3 method for rmult
has.missings(x,mc=attr(x,"missingClassifier"),...)
SZvalue
MARvalue
MNARvalue
BDLvalue

Value

A logical vector or matrix with the same shape as x stating wether or not the value is of the given type of missing.

Arguments

x: A vector, matrix, acomp, rcomp, aplus, rplus object for which we would like to know the missing status of the entries
mc: A missing classifier function, giving for each value one of the values BDL (Below Detection Limit), SZ (Structural Zero), MAR (Missing at random), MNAR (Missing not at random), NMV (Not missing value) This functions are introduced to allow a different coding of the missings.
...: further generic arguments

Author

K.Gerald v.d. Boogaart http://www.stat.boogaart.de, Raimon Tolosana Delgado, Matevz Bren

Details

In the context of compositional data we have to consider at least four types of missing and zero values:

MAR: (Missing at random) coded by NaN, the amount was not observed or is otherwise missing, in a way unrelated to its actual value. This is the "nice" type of missing.
MNAR: (Missing not at random) coded by NA, the amount was not observed or is otherwise missing, but it was missed in a way stochastically dependent on its actual value.
BDL: (Below detection limit) coded by 0.0 or a negative number giving the detection limit; the amount was observed but turned out to be below the detection limit and was thus rounded to zero. This is an informative version of MNAR.
SZ: (Structural zero) coded by -Inf, the amount is absolutely zero due to structural reasons. E.g. a soil sample was dried before the analysis, or the sample was preprocessed so that the fraction is removed. Structural zeroes are mainly treated as MAR even though they are a kind of MNAR.
NMV: (Not Missing Value) coded by a real number, it is just an actually-observed value.
WMNAR: (Wider MNAR) includes BDL and MNAR.
WZERO: (Wider Zero) includes BDL and SZ

Each function of type is.XXX checks the status of its argument according to the XXX type of value from those above.

Different steps of a statistical analysis and different understanding of the data will lead to different approaches with respect to missings and zeros.
In the first exploratory step, the problem is to keep the methods working and to make the missing structure visible in the analysis. The user should need as less as possible extra thinking about missings, an get nevertheless a true picture of the data. To achieve this we tried to make the basic layer of computational functions working consitently with missings and propagating the missingness character seamlessly. However some of this only works with acomp, where a closed form missing theories are available (e.g. proportional imputation [e.g. Mart\'in-Fern\'andez, J.A. et al.(2003)]or estimation with missings [Boogaart&Tolosana 2006]). The main graphics should hint towards missing and try to add missings to the plot by marking the remaining informaion on the axes. However one again should be clear that this is only reasonably justified in the relative geometries. Unfortunatly the missing subsystem is currently not fully compatible with the robustness subsystem.
As a second step, the analyst might want to analyse the missing structure for itself. This is preliminarly provided by these functions, since their result can be treated as a boolean data set in any other R function. Additionally a missingSummary provides some a convenience function to provide a fast overview over the different types of missings in the dataset.
In the later inferential steps, the problem is to get results valid with respect to a model. One needs to be able to look through the data on the true processes behind, without being distracted by artifacts stemming from missing values. For the moment, how analyses react to the presence of missings depend on the value of the na.action option. If this is set to na.omit (the default), then cases with missing values on any variable are completely ignored by the analysis. If this is set to na.pass, then some of the following applies.
The policy on how a missing value is to be introduced into the analysis depends on the purpose of the analysis, the type of analysis and the model behind. With respect to this issue this package and probabily the whole science of compositional data analysis is still very preliminary.
The four philosophies work with different approaches to these problems:

rplus: For positive real vectors, one can either identify BDL with a true 0 or impute a value relative to the detection limit, with a function like zeroreplace. A structural zero can either be seen as a true zero or as a MAR value.
rcomp and acomp: For these relative geometries, a true zero is an alien. Thus a BDL is nothing else but a small unkown value. We could either decide to replace the value by an imputation, or go through the whole analysis keeping this lack of information in mind. The main problem of imputation is that by closing to 1, the absolute value of the detection limit is lost, and the detection limit can correspond to very different portions. Raw differences between all, observed or missed, components (the ground of the rcomp geometry) are completely distorted by the replacement. Contrarily, log-ratios between observed components do not change but ratios between missed components dramatically depend on the replacement, e.g. typically the content of gold is some orders of magnitude smaller than the contend of silver even around a gold deposit, but far away from the deposit they both might be far under detection limit, leading to a ratio of 1, just because nothing was observed. SZ in compositions might be either seen as defining two sub-populations, one fully defined and one where only a subcomposition is defined. But SZ can also very much be like an MAR, if only a subcomposition is measured. Thus, in general we can simply understand that only a subcomposition is available, i.e. a projection of the true value onto a sub-space: for each observation, this sub-space might be different. For MAR values, this approach is stricly valid, and yields unbiased estimations (because these projections are stochastically independent of the observed phenomenon). For MNAR values, the projections depend on the actual value, which strictly speaking yields biased estimations.
aplus: Imputation takes place by simple replacement of the value. However this can lead to a dramatic change of ratios and should thus be used only with extra care, by the same reasons explained before.

References

Boogaart, K.G. v.d., R. Tolosana-Delgado, M. Bren (2006) Concepts for handling of zeros and missing values in compositional data, in E. Pirard (ed.) (2006)Proccedings of the IAMG'2006 Annual Conference on "Quantitative Geology from multiple sources", September 2006, Liege, Belgium, S07-01, 4pages, http://stat.boogaart.de/Publications/iamg06_s07_01.pdf, ISBN: 978-2-9600644-0-7

Aitchison, J. (1986) The Statistical Analysis of Compositional Data Monographs on Statistics and Applied Probability. Chapman & Hall Ltd., London (UK). 416p.

Aitchison, J, C. Barcel'o-Vidal, J.J. Egozcue, V. Pawlowsky-Glahn (2002) A consise guide to the algebraic geometric structure of the simplex, the sample space for compositional data analysis, Terra Nostra, Schriften der Alfred Wegener-Stiftung, 03/2003
Billheimer, D., P. Guttorp, W.F. and Fagan (2001) Statistical interpretation of species composition, Journal of the American Statistical Association, 96 (456), 1205-1214

Mart\'in-Fern\'andez, J.A., C. Barcel\'o-Vidal, and V. Pawlowsky-Glahn (2003) Dealing With Zeros and Missing Values in Compositional Data Sets Using Nonparametric Imputation. Mathematical Geology, 35(3) 253-278

Examples

Run this code

require(compositions)      # load library
data(SimulatedAmounts)     # load data sa.lognormals
dat <- acomp(sa.missings)
dat
var(dat)
mean(dat)
plot(dat)
boxplot(dat)
barplot(dat)

Run the code above in your browser using DataLab