Learn R Programming

Ecfun (version 0.2-2)

BoxCox: Box-Cox power transformation and its inverse

Description

Box and Cox (1964) considered the following family of transformations indexed by lambda:

w = (y^lambda-1)/lambda

= expm1(lambda*log(y))/lambda,

with the lambda=0 case defined as log(y) to make w continuous in lambda for constant y.

They estimate lambda assuming w follows a normal distribution. This raises a theoretical problem in that y must be positive, which means that w must follow a truncated normal distribution conditioned on lambda*w > (-1).

Bickel and Doksum (1981) removed the restriction to positive y, i.e., to w > (-1/lambda) by modifying the transformaton as follows:

w =

(sgn(y)*abs(y)^lambda-1)/lambda if lambda != 0 and

sgn(y)*log(abs(y)) if lambda = 0,

where sgn(y) = 1 if y >= 0 and -1 otherwise.

NOTE: sgn(y) is different from sign(y), which is 0 for y = 0. A two-argument update to the sign function in the base package has been added to this Ecfun package, so sign(y, 1) = sgn(y).

If (y<0), this transformation is discontinuous at lambda = 0. To see this, we rewrite this as

w = (sgn(y)*expm1(lambda*log(abs(y))) + (sgn(y)-1)) / lambda

= sgn(y)*(log(abs(y)) + O(lambda) + (sgn(y)-1)/lambda,

where O(lambda) indicates a term that is dominated by a constant times lambda.

If y<0, this latter term (sgn(y)-1)/lambda = (-2)/lambda and becomes Inf as lambda -> 0.

In practice, we assume that y > 0, so this distinction has little practical value. However, the BoxCox function computes the Bickel-Doksum version.

Box and Cox further noted that proper estimation of lambda should include the Jacobian of the transformation in the log(likelihood). Doing this can be achieved by rescaling the transformation with the nth root of the Jacobian, which can be written as follows:

j(y, lambda) = J(y, lambda)^(1/n) = GeometricMean(y)^(lambda-1).

With this the rescaled power transformation is as follows:

z = (y^lambda-1)/(lambda*j(y, lambda) if lambda!=0 or GeometricMean(y)*log(y) if lambda==0.

In addition to facilitating estimation of lambda, rescaling has the advantage that the units of z are the same as the units of y.

The ouput has class 'BoxCox', which has attributes that allow the input to be recovered using invBoxCox. The default values of the arguments of invBoxCox are provided by the corresponding attributes of z.

Usage

BoxCox(y, lambda, rescale=TRUE, na.rm=rescale) 
  invBoxCox(z, lambda, sign.y, GeometricMean, rescale)

Arguments

y

a numeric vector for which the power transform is desired

lambda

A numeric vector of length 1 or 2. The first component is the power. If the second component is provided, y is replaced by y+lambda[2].

rescale

logical or numeric. If logical:

For BoxCox, this is TRUE to return the power transform with rescale, z, above, and FALSE to return the power transform without the nth root of the Jacobian, w, above. This defaults to TRUE, because this will give z the same units as y.

For invBoxCox, this is TRUE if the input argument z is assumed to have been rescaled by the nth root of the Jacobian of the transformation. This defaults to a rescale attribute of z if present or to TRUE if absent.

If numeric, it is assumed to be the geometric mean of another set of y values to use with new y's.

na.rm

logical: TRUE to remove NAs from y before computing the geometric mean.

FALSE to compute NA for the geometric mean if any(is.na(y)).

NOTE: If na.rm = FALSE, the output will be all NA if rescale = TRUE. This could produce non usable answers in most cases. To avoid that, the default for na.rm is TRUE whenever rescale = TRUE. Conversely, applications using na.rm = FALSE will likely also want rescale = FALSE to avoid returning a non-answer in these cases. This explains the default na.rm = rescale.

z

a numeric vector or an object of class BoxCox for which the inverse Box-Cox transform is desired.

sign.y

an optional logical vector giving sign(y-lambda[2]) of the data values that presumably generated z. Defaults to an sign.y attribute of z or to rep(1, length(z)) if no such attribute is present.

GeometricMean

an optional numeric scalar giving the geometric mean of the data values that presumably generated z. Defaults to a GeometricMean attribute of z or to 1 if no such attribute is present.

Value

BoxCox returns an object of class BoxCox, being a numeric vector of the same length as y with the following optional attributes:

  • lambdathe value of lambda used in the transformation

  • sign.y sign(y) (or sign(y-lambda[2]) lambda[2] is provided and if any of these quantities are negative. Otherwise, this is omitted and all are assumed to be positive.

  • rescale logical: TRUE if z(y, lambda) is returned rescaled by g^(lambda-1) with g = the geometric mean of y and FALSE if z(y, lambda) is not so rescaled.

  • GeometricMean If rescale is numeric, attr(., 'GeometricMean') <- rescale. Otherwise, attr(., 'GeometricMean') is the Geometric mean of abs(y) = exp(mean(log(abs(y)))) or of abs(y+lambda[2]) if(length(lambda)>1).

invBoxCox returns a numeric vector, reconstructing y from BoxCox(y, ...).

Details

Box and Cox (1964) discussed

w(y, lambda) = (y^lambda - 1)/lambda.

They noted that w is continuous in lambda with w(y, lambda) = log(y) if lambda = 0 (by l'Hopital's rule).

They also discussed

z(y, lambda) = (y^lambda - 1)/(lambda*g^(lambda-1)),

where g = the geometric mean of y.

They noted that proper estimation of lambda should include the Jacobian of w(y, lambda) with the likelihood. They further showed that a naive normal likelihood using z(y, lambda) as the response without a Jacobian is equivalent to the normal likelihood using w(y, lambda) adjusted appropriately using the Jacobian. See Box and Cox (1964) or the Wikipedia article on "Power transform".

Bickel and Doksum (1981) suggested adding sign(y) to the transformation, as discussed above.

NUMERICAL ANALYSIS:

Consider the Bickel and Doksum version described above:

w <- (sign(y)*abs(y)^lambda-1)/lambda

if(any(y==0)), GeometricMean(y) = 0. This creates a problem with the above math.

Let ly = log(abs(y)). Then with la = lambda,

w = code(sign(y)*exp(la*ly)-1)/la

= sign(y)*ly(1+(la*ly/2)*(1+(la*ly/3)*(1+(la*ly/4)*(1+O(la*ly))))) + (sign(y)-1)/la

For y>0, the last term is zero. boxcox ignores cases with y<=0 and uses this formula (ignoring the final O(la*ly)) whenever abs(la) <= eps = 1/50. That form is used here also.

For invBoxCox a complementary analysis is as follows:

abs(y+lambda[2]) = abs(1+la*w)^(1/la)

= exp(log1p(la*w)/la) for abs(la*w)<1

= w*(1-la*w*((1/2)-la*w*((1/3)-la*w*(1/4-...))))

References

Wikipedia, "Power transform"

See Also

boxcox in the MASS package

quine in the MASS package for data used in an example below.

boxcox and boxcoxCensored in the EnvStats package.

boxcox.drc in the drc package.

boxCox in the car package.

These other uses all wrap the Box-Cox transformation in something larger and do not give the transformation itself directly.

Examples

Run this code
# NOT RUN {
##
## 1.  A simple example to check the two algorithms 
##
Days <- 0:9
bc1 <- BoxCox(Days, c(0.01, 1))
# Taylor expansion used for obs 1:7; expm1 for 8:10 

# check 
GM <- exp(mean(log(abs(Days+1))))

bc0 <- (((Days+1)^0.01)-1)/0.01
bc1. <- (bc0 / (GM^(0.01-1)))  
# log(Days+1) ranges from 0 to 4.4 
# lambda = 0.01 will invoke both the obvious
# algorithm and the alternative assumed to be 
# more accurate for (lambda(log(y)) < 0.02).  
attr(bc1., 'lambda') <- c(0.01, 1)  
attr(bc1., 'rescale') <- TRUE 
attr(bc1., 'GeometricMean') <- GM 
class(bc1.) <- 'BoxCox'

# }
# NOT RUN {
all.equal(bc1, bc1.)
# }
# NOT RUN {
##
## 2.  another simple example with lambda=0
##
bc0.4 <- BoxCox(1:5, 0)
GM5 <- prod(1:5)^.2
bc0.4. <- log(1:5)*GM5
attr(bc0.4., 'lambda') <- 0  
attr(bc0.4., 'rescale') <- TRUE 
attr(bc0.4., 'GeometricMean') <- GM5 
class(bc0.4.) <- 'BoxCox'

# }
# NOT RUN {
all.equal(bc0.4, bc0.4.)
# }
# NOT RUN {
bc0.4e9 <- BoxCox(1:5, .Machine$double.eps)
bc0.4ex <- log(1:5)*exp(mean(log(1:5)))
# }
# NOT RUN {
all.equal(bc0.4ex, as.numeric(bc0.4e9))
# }
# NOT RUN {
# now invert:  

bc0.4i <- invBoxCox(bc0.4.)
# }
# NOT RUN {
all.equal(1:5, bc0.4i)
# }
# NOT RUN {
# }
# NOT RUN {
all.equal(1:5, invBoxCox(bc0.4e9))
# }
# NOT RUN {
##
## 3.  The "boxcox" function in the MASS package 
##     computes a maximum likelihood estimate with  
##     BoxCox(Days+1, lambda=0.21) 
##     with a 95 percent confidence interval of 
##     approximately (0.08, 0.35)
##
bcDays1 <- BoxCox(MASS::quine$Days, c(0.21, 1))

# check 
GeoMean <- exp(mean(log(abs(MASS::quine$Days+1))))

bcDays1. <- ((((MASS::quine$Days+1)^0.21)-1) / 
               (0.21*GeoMean^(0.21-1)))  
# log(Days+1) ranges from 0 to 4.4 
attr(bcDays1., 'lambda') <- c(0.21, 1)  
attr(bcDays1., 'rescale') <- TRUE 
attr(bcDays1., 'GeometricMean') <- GeoMean 
class(bcDays1.) <- 'BoxCox'

# }
# NOT RUN {
all.equal(bcDays1, bcDays1.)
# }
# NOT RUN {
iDays <- invBoxCox(bcDays1)
# }
# NOT RUN {
all.equal(iDays, MASS::quine$Days)
# }
# NOT RUN {
##
## 4.  Easily computed example 
##
bc2 <- BoxCox(c(1, 4), 2)

# check 
bc2. <- (c(1, 4)^2-1)/4
attr(bc2., 'lambda') <- 2
attr(bc2., 'rescale') <- TRUE 
attr(bc2., 'GeometricMean') <- 2 
class(bc2.) <- 'BoxCox'

# }
# NOT RUN {
all.equal(bc2, bc2.)
# }
# NOT RUN {
# }
# NOT RUN {
all.equal(invBoxCox(bc2), c(1, 4))
# }
# NOT RUN {
##
## 5.  plot(BoxCox())
##
y0 <- seq(-2, 2, .1)
z2 <- BoxCox(y0, 2, rescale=FALSE)
plot(y0, z2)

# check 
z2. <- (sign(y0)*y0^2-1)/2

attr(z2., 'lambda') <- 2
attr(z2., 'sign.y') <- sign(y0, 1)
attr(z2., 'rescale') <- FALSE 
attr(z2., 'GeometricMean') <- 0
class(z2.) <- 'BoxCox'

# }
# NOT RUN {
all.equal(z2, z2.)
# }
# NOT RUN {
# }
# NOT RUN {
all.equal(invBoxCox(z2), y0)
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab