fairml.cv: Cross-Validation for Fair Models

Description

Cross-validation for the models in the fairml package.

Usage

fairml.cv(response, predictors, sensitive, method = "k-fold", ..., unfairness,
  model, model.args = list(), cluster)
cv.loss(x)
cv.unfairness(x)
cv.folds(x)

Value

fairml.cv() returns an object of class fair.kcv.list if

runs is at least 2, an object of class fair.kcv if runs

is equal to 1.

cv.loss() returns a numeric vector or a numeric matrix containing the values of the loss function computed for each run of cross-validation.

cv.unfairness() returns a numeric vectors containing the values of the unfairness criterion computed on the validation folds for each run of cross-validation.

cv.folds() returns a list containing the indexes of the observations in each of the cross-validation folds. In the case of k-fold cross-validation, if runs is larger than 1, each element of the list is itself a list with the indexes for the observations in each fold in each run.

Arguments

response: a numeric vector, the response variable.
predictors: a numeric matrix or a data frame containing numeric and factor columns; the predictors.
sensitive: a numeric matrix or a data frame containing numeric and factor columns; the sensitive attributes.
method: a character string, either k-fold, custom-folds or hold-out. See below for details.
...: additional arguments for the cross-validation method.
unfairness: a positive number in [0, 1], the proportion of the explained variance that can be attributed to the sensitive attributes.
model: a character string, the label of the model. Currently "nclm", "frrm", "fgrrm", "zlm" and "zlrm" are available.
model.args: additional arguments passed to model estimation.
cluster: an optional cluster object from package parallel, to process folds or subsamples in parallel.
x: an object of class fair.kcv or fair.kcv.list.

Author

Marco Scutari

Details

The following cross-validation methods are implemented:

k-fold: the data are split in k subsets of equal size. For each subset in turn, model is fitted on the other k - 1 subsets and the loss function is then computed using that subset. Loss estimates for each of the k subsets are then combined to give an overall loss for data.
custom-folds: the data are manually partitioned by the user into subsets, which are then used as in k-fold cross-validation. Subsets are not constrained to have the same size, and every observation must be assigned to one subset.
hold-out: k subsamples of size m are sampled independently without replacement from the data. For each subsample, model is fitted on the remaining m - length(response) samples and the loss function is computed on the m observations in the subsample. The overall loss estimate is the average of the k loss estimates from the subsamples.

Cross-validation methods accept the following optional arguments:

k: a positive integer number, the number of groups into which the data will be split (in k-fold cross-validation) or the number of times the data will be split in training and test samples (in hold-out cross-validation).
m: a positive integer number, the size of the test set in hold-out cross-validation.
runs: a positive integer number, the number of times k-fold or hold-out cross-validation will be run.
folds: a list in which element corresponds to one fold and contains the indices for the observations that are included to that fold; or a list with an element for each run, in which each element is itself a list of the folds to be used for that run.

If cross-validation is used with multiple runs, the overall loss is the average of the loss estimates from the different runs.

The predictive performance of the models is measured using the mean square error as the loss function.

Examples

Run this code

kcv = fairml.cv(response = vu.test$gaussian, predictors = vu.test$X,
        sensitive = vu.test$S, unfairness = 0.10, model = "nclm",
        method = "k-fold", k = 10, runs = 10)
kcv
cv.loss(kcv)
cv.unfairness(kcv)

# run a second cross-validation with the same folds.
fairml.cv(response = vu.test$gaussian, predictors = vu.test$X,
        sensitive = vu.test$S, unfairness = 0.10, model = "nclm",
        method = "custom-folds", folds = cv.folds(kcv))

# run cross-validation in parallel.
if (FALSE) {
library(parallel)
cl = makeCluster(2)
fairml.cv(response = vu.test$gaussian, predictors = vu.test$X,
  sensitive = vu.test$S, unfairness = 0.10, model = "nclm",
  method = "k-fold", k = 10, runs = 10, cluster = cl)
stopCluster(cl)
}

Run the code above in your browser using DataLab