maxCorGrid: (Robust) maximum correlation via alternating series of grid searches

Description

Compute the maximum correlation between two data sets via projection pursuit based on alternating series of grid searches in two-dimensional subspaces of each data set, with a focus on robust and nonparametric methods.

Usage

maxCorGrid(
  x,
  y,
  method = c("spearman", "kendall", "quadrant", "M", "pearson"),
  control = list(...),
  nIterations = 10,
  nAlternate = 10,
  nGrid = 25,
  select = NULL,
  tol = 1e-06,
  standardize = TRUE,
  fallback = FALSE,
  seed = NULL,
  ...
)

Value

An object of class "maxCor" with the following components:

cor: a numeric giving the maximum correlation estimate.
a: numeric; the weighting vector for x.
b: numeric; the weighting vector for y.
centerX: a numeric vector giving the center estimates used in standardization of x.
centerY: a numeric vector giving the center estimates used in standardization of y.
scaleX: a numeric vector giving the scale estimates used in standardization of x.
scaleY: a numeric vector giving the scale estimates used in standardization of y.
call: the matched function call.

Arguments

x, y: each can be a numeric vector, matrix or data frame.
method: a character string specifying the correlation functional to maximize. Possible values are "spearman" for the Spearman correlation, "kendall" for the Kendall correlation, "quadrant" for the quadrant correlation, "M" for the correlation based on a bivariate M-estimator of location and scatter with a Huber loss function, or "pearson" for the classical Pearson correlation (see corFunctions).
control: a list of additional arguments to be passed to the specified correlation functional. If supplied, this takes precedence over additional arguments supplied via the ... argument.
nIterations: an integer giving the maximum number of iterations.
nAlternate: an integer giving the maximum number of alternate series of grid searches in each iteration.
nGrid: an integer giving the number of equally spaced grid points on the unit circle to use in each grid search.
select: optional; either an integer vector of length two or a list containing two index vectors. In the first case, the first integer gives the number of variables of x to be randomly selected for determining the order of the variables of y in the corresponding series of grid searches, and vice versa for the second integer. In the latter case, the first list element gives the indices of the variables of x to be used for determining the order of the variables of y, and vice versa for the second integer (see “Details”).
tol: a small positive numeric value to be used for determining convergence.
standardize: a logical indicating whether the data should be (robustly) standardized.
fallback: logical indicating whether a fallback mode for robust standardization should be used. If a correlation functional other than the Pearson correlation is maximized, the first attempt for standardizing the data is via median and MAD. In the fallback mode, variables whose MADs are zero (e.g., dummy variables) are standardized via mean and standard deviation. Note that if the Pearson correlation is maximized, standardization is always done via mean and standard deviation.
seed: optional initial seed for the random number generator (see .Random.seed). This is only used if select specifies the numbers of variables of each data set to be randomly selected for determining the order of the variables of the respective other data set.
...: additional arguments to be passed to the specified correlation functional.

Author

Andreas Alfons

Details

The algorithm is based on alternating series of grid searches in two-dimensional subspaces of each data set. In each grid search, nGrid grid points on the unit circle in the corresponding plane are obtained, and the directions from the center to each of the grid points are examined. In the first iteration, equispaced grid points in the interval \([-\pi/2, \pi/2)\) are used. In each subsequent iteration, the angles are halved such that the interval \([-\pi/4, \pi/4)\) is used in the second iteration and so on. If only one data set is multivariate, the algorithm simplifies to iterative grid searches in two-dimensional subspaces of the corresponding data set.

In the basic algorithm, the order of the variables in a series of grid searches for each of the data sets is determined by the average absolute correlations with the variables of the respective other data set. Since this requires to compute the full \((p \times q)\) matrix of absolute correlations, where \(p\) denotes the number of variables of x and \(q\) the number of variables of y, a faster modification is available as well. In this modification, the average absolute correlations are computed over only a subset of the variables of the respective other data set. It is thereby possible to use randomly selected subsets of variables, or to specify the subsets of variables directly.

Note that also the data sets are ordered according to the maximum average absolute correlation with the respective other data set to ensure symmetry of the algorithm.

References

A. Alfons, C. Croux and P. Filzmoser (2016) Robust maximum association between data sets: The R Package ccaPP. Austrian Journal of Statistics, 45(1), 71--79.

A. Alfons, C. Croux and P. Filzmoser (2016) Robust maximum association estimators. Journal of the American Statistical Association, 112(517), 435--445.

Examples

Run this code

data("diabetes")
x <- diabetes$x
y <- diabetes$y

## Spearman correlation
maxCorGrid(x, y, method = "spearman")
maxCorGrid(x, y, method = "spearman", consistent = TRUE)

## Pearson correlation
maxCorGrid(x, y, method = "pearson")

Run the code above in your browser using DataLab