The Ordered Quantile (ORQ) normalization transformation,
orderNorm()
, is a rank-based procedure by which the values of a
vector are mapped to their percentile, which is then mapped to the same
percentile of the normal distribution. Without the presence of ties, this
essentially guarantees that the transformation leads to a uniform
distribution.
The transformation is: $$g(x) = \Phi ^ {-1} ((rank(x) - .5) / (length(x)))$$
Where \(\Phi\) refers to the standard normal cdf, rank(x) refers to each observation's rank, and length(x) refers to the number of observations.
By itself, this method is certainly not new; the earliest mention of it that I could find is in a 1947 paper by Bartlett (see references). This formula was outlined explicitly in Van der Waerden, and expounded upon in Beasley (2009). However there is a key difference to this version of it, as explained below.
Using linear interpolation between these percentiles, the ORQ normalization becomes a 1-1 transformation that can be applied to new data. However, outside of the observed domain of x, it is unclear how to extrapolate the transformation. In the ORQ normalization procedure, a binomial glm with a logit link is used on the ranks in order to extrapolate beyond the bounds of the original domain of x. The inverse normal CDF is then applied to these extrapolated predictions in order to extrapolate the transformation. This mitigates the influence of heavy-tailed distributions while preserving the 1-1 nature of the transformation. The extrapolation will provide a warning unless warn = FALSE.) However, we found that the extrapolation was able to perform very well even on data as heavy-tailed as a Cauchy distribution (paper to be published).
The fit used to perform the extrapolation uses a default of 10000 observations (or length(x) if that is less). This added approximation improves the scalability, both computationally and in terms of memory used. Do not set this value to be too low (e.g. <100), as there is no benefit to doing so. Increase if your test data set is large relative to 10000 and/or if you are worried about losing signal in the extremes of the range.
This transformation can be performed on new data and inverted via the
predict
function.
orderNorm(x, n_logit_fit = min(length(x), 10000), ..., warn = TRUE)# S3 method for orderNorm
predict(object, newdata = NULL, inverse = FALSE, warn = TRUE, ...)
# S3 method for orderNorm
print(x, ...)
A list of class orderNorm
with elements
transformed original data
original data
number of nonmissing observations
indicator if ties are present
fit to be used for extrapolation, if needed
Pearson's P / degrees of freedom
The predict
function returns the numeric value of the transformation
performed on new data, and allows for the inverse transformation as well.
A vector to normalize
Number of points used to fit logit approximation
additional arguments
transforms outside observed range or ties will yield warning
an object of class 'orderNorm'
a vector of data to be (reverse) transformed
if TRUE, performs reverse transformation
Bartlett, M. S. "The Use of Transformations." Biometrics, vol. 3, no. 1, 1947, pp. 39-52. JSTOR www.jstor.org/stable/3001536.
Van der Waerden BL. Order tests for the two-sample problem and their power. 1952;55:453-458. Ser A.
Beasley TM, Erickson S, Allison DB. Rank-based inverse normal transformations are increasingly used, but are they merited? Behav. Genet. 2009;39(5): 580-595. pmid:19526352
boxcox
, lambert
,
bestNormalize
, yeojohnson
x <- rgamma(100, 1, 1)
orderNorm_obj <- orderNorm(x)
orderNorm_obj
p <- predict(orderNorm_obj)
x2 <- predict(orderNorm_obj, newdata = p, inverse = TRUE)
all.equal(x2, x)
Run the code above in your browser using DataLab