This function implements the Box and Cox (1964) method of selecting a power transformation of a variable toward normality, and its generalization by Velilla (1993) to a multivariate response. Cook and Weisberg (1999) and Weisberg (2014) suggest the usefulness of transforming a set of predictors z1, z2, z3
for multivariate normality. It also includes two additional families that allow for negative values.
If the ‘object’ argument is of class ‘lm’ or ‘lmerMod’, the Box-Cox procedure is applied to the conditional distribution of the response given the predictors. For ‘lm’ objects, the respose may be multivariate, and each column will have its own transformation. With ‘lmerMod’ the response must be univariate.
The ‘object’ argument may also be a formula. For example, z ~ x1 + x2 + x3
will estimate a transformation for the response z
from a family after fitting a linear model with the given formula. cbind(y1, y2, y3) ~ 1
specifies transformations
to multivariate normality with no predictors. A vector value for ‘object’, for example
powerTransform(ais$LBM)
, is equivalent topowerTransform(LBM ~ 1, ais)
. Similarly, powerTransform(cbind(ais$LBM, ais$SSF))
, where the first argument is a matrix rather than a formula is equivalent to specification of a mulitvariate linear model powerTransform(cbind(LBM, SSF) ~ 1, ais)
.
Three families of power transformations are available. The default Box-Cox power family (family="bcPower"
) of power transformations effectively replaces a vector by that vector raised to a power, generally in the range from -3 to 3. For powers close to zero, the log-transformtion is suggested. In practical situations, after estimating a power using the powerTransform
function, a variable would be replaced by a simple power transformation of it, for example, if \(\lambda\approx 0.5\), then the correspoding variable would be replaced by its square root; if \(\lambda\) is close enough to zero, the the variable would be replaced by its natural logarithm. The Box-Cox family requires the responses to be strictly positive.
The family="bcnPower"
, or Box-Cox with negatives, family proposed by Hawkins and Weisberg (2017) allows for (a few) non-positive values, while allowing for the transformed data to be interpreted similarly to the interpretation of Box-Cox transformed values. This family is the Box-Cox transformation of \(z = .5 * (y + (y^2 + \gamma^2)^{1/2})\) that depends on a location parameter \(\gamma\). The quantity \(z\) is positive for all values of \(y\). If \(\gamma = 0\) and \(y\) is strictly positive, then the Box-Cox and the bcnPower transformations are identical. When fitting the Box-Cox with negatives family, lambda
is restricted to the range [-3, 3], and gamma is restricted to the range from .01 to the largest positive value of the variable, since values outside these ranges are unreasonable in practice.
The final family family="yjPower"
uses the Yeo-Johnson transformation, which is the Box-Cox transformation of \(U+1\) for nonnegative values, and of \(|U|+1\) with parameter \(2-\lambda\) for \(U\) negative and thus it provides a family for fitting when (a few) observations are negative. Because of the unusual constraints on the powers for positive and negative data, this transformation is not used very often, as results are difficult to interpret. In practical problems, a variable would be replaced by its Yeo-Johnson transformation computed using the yjPower
function.
The function testTransform
is used to obtain likelihood ratio tests for any specified value for the transformation parameter(s).
Computations maimize the likelihood-like functions described by Box and Cox (1964) and by Velilla (2000). For univariate responses, the computations are very stable and problems are unlikely, although for ‘lmer’ models computations may be very slow because the model is refit many times. For multivariate responses with the bcnPower
family, the computing algorithm may fail. In this case we recommend adding the argument itmax = 1
to the call to powerTransform
. This will return the starting value estimates of the transformation parameters, fitting a d-dimensional response as if all the d responses were independent.