roc: ROC curve analysis

Description

Fits Receiver Operator Characteristic (ROC) curves to training set data. Used to determine the critical value of a dissimilarity coefficient that best descriminate between assemblage-types in palaeoecological data sets, whilst minimising the false positive error rate (FPF).

Usage

roc(object, groups, k = 1, ...)
# S3 method for default
roc(object, groups, k = 1, thin = FALSE,
    max.len = 10000, ...)
# S3 method for mat
roc(object, groups, k = 1, ...)
# S3 method for analog
roc(object, groups, k = 1, ...)

Value

A list with two components; i, statistics, a summary of ROC statistics for each level of groups and a combined ROC analysis, and ii, roc, a list of ROC objects, one per level of

groups. For the latter, each ROC object is a list, with the following components:

TPF: The true positive fraction.
FPE: The false positive error

optimal: The optimal dissimilarity value, asessed where the slope of the ROC curve is maximal.
AUC: The area under the ROC curve.
se.fit: Standard error of the AUC estimate.
n.in: numeric; the number of samples within the current group.
n.out: numeric; the number of samples not in the current group.
p.value: The p-value of a Wilcoxon rank sum test on the two sets of dissimilarities. This is also known as a Mann-Whitney test.
roc.points: The unique dissimilarities at which the ROC curve was evaluated
max.roc: numeric; the position along the ROC curve at which the slope of the ROC curve is maximal. This is the index of this point on the curve.
prior: numeric, length 2. Vector of observed prior probabilities of true analogue and true non-analogues in the group.
analogue: a list with components yes and no containing the dissimilarities for true analogue and true non-analogues in the group.

Arguments

object: an R object.
groups: a vector of group memberships, one entry per sample in the training set data. Can be a factor, and will be coerced to one if supplied vecvtor is not a factor.
k: numeric; the k closest analogues to use to calculate ROC curves.
thin: logical; should the points on the ROC curve be thinned? See Details, below.
max.len: numeric; length of analolgue and non-analogue vectors. Used as limit to thin points on ROC curve to.
...: arguments passed to/from other methods.

Author

Gavin L. Simpson, based on code from Thomas Lumley to optimise the calculation of the ROC curve.

Details

A ROC curve is generated from the within-group and between-group dissimilarities.

For each level of the grouping vector (groups) the dissimilarity between each group member and it's k closest analogues within that group are compared with the k closest dissimilarities between the non-group member and group member samples.

If one is able to discriminate between members of different group on the basis of assemblage dissimilarity, then the dissimilarities between samples within a group will be small compared to the dissimilarities between group members and non group members.

thin is useful for large problems, where the number of analogue and non-analogue distances can conceivably be large and thus overflow the largest number R can work with. This option is also useful to speed up computations for large problems. If thin == TRUE, then the larger of the analogue or non-analogue distances is thinned to a maximum length of max.len. The smaller set of distances is scaled proportionally. In thinning, we approximate the distribution of distances by taking max.len (or a fraction of max.len for the smaller set of distances) equally-spaced probability quantiles of the distribution as a new set of distances.

References

Brown, C.D., and Davis, H.T. (2006) Receiver operating characteristics curves and related decision measures: A tutorial. Chemometrics and Intelligent Laboratory Systems 80, 24--38.

Gavin, D.G., Oswald, W.W., Wahl, E.R. and Williams, J.W. (2003) A statistical approach to evaluating distance metrics and analog assignments for pollen records. Quaternary Research 60, 356--367.

Henderson, A.R. (1993) Assessing test accuracy and its clinical consequences: a primer for receiver operating characteristic curve analysis. Annals of Clinical Biochemistry 30, 834--846.

Examples

Run this code

## load the example data
data(swapdiat, swappH, rlgh)

## merge training and test set on columns
dat <- join(swapdiat, rlgh, verbose = TRUE)

## extract the merged data sets and convert to proportions
swapdiat <- dat[[1]] / 100
rlgh <- dat[[2]] / 100

## fit an analogue matching (AM) model using the squared chord distance
## measure - need to keep the training set dissimilarities
swap.ana <- analog(swapdiat, rlgh, method = "SQchord",
                   keep.train = TRUE)

## fit the ROC curve to the SWAP diatom data using the AM results
## Generate a grouping for the SWAP lakes
METHOD <- if (getRversion() < "3.1.0") {"ward"} else {"ward.D"}
clust <- hclust(as.dist(swap.ana$train), method = METHOD)
grps <- cutree(clust, 12)

## fit the ROC curve
swap.roc <- roc(swap.ana, groups = grps)
swap.roc

## draw the ROC curve
plot(swap.roc, 1)

## draw the four default diagnostic plots
layout(matrix(1:4, ncol = 2))
plot(swap.roc)
layout(1)

Run the code above in your browser using DataLab