clusterer: Cluster Analysis Verification

Description

Perform Cluster Analysis (CA) verifcation per Marzban and Sandgathe (2006).

Usage

clusterer(X, Y = NULL, ...)
# S3 method for default
clusterer(X, Y = NULL, ..., xloc = NULL, xyp = TRUE, threshold = 1e-08, 
    linkage.method = "complete", stand = TRUE, trans = "identity", 
    a = NULL, verbose = FALSE)
# S3 method for SpatialVx
clusterer(X, Y = NULL, ..., time.point = 1, obs = 1, model = 1, xyp = TRUE, 
    threshold = 1e-08, linkage.method = "complete", stand = TRUE, 
    trans = "identity", verbose = FALSE)
# S3 method for clusterer
plot(x, ..., mfrow = c(1, 2), col = c("gray", tim.colors(64)), 
    horizontal = FALSE)
# S3 method for summary.clusterer
plot(x, ...)
# S3 method for clusterer
print(x, ...)
# S3 method for clusterer
summary(object, ...)

Value

A list object of class “clusterer” is returned with components:

linkage.method: character vector of length one or two giving the linkage method as passed into the function. The length is two only if the McQuitty method is chosen in which case this method is used for the CA, but not for the inter-cluster differencs across fields (average is used for that instead).
trans: character naming the transformation function applied to the intensities.
N: numeric giving the size of the fields.
threshold: numeric of length two giving the threshold applied to each field.
NCo,NCf: numeric vectors giving the number of clusters at each iteration of the CA for the verification and forecast fields, resp.
cluster.identifiers: a list with components X and Y giving lists of lists identifying specific CA components at each level of the CA for both fields.
idX,idY: logical vectors describing which grid points were included in the CA for each field (i.e., which grid points were >= threshold and had non-missing values).
cluster.objects: a list with components X and Y giving the objects returned by hclust for each field.
inter.cluster.dist: a list of list objects with NCf by NCo matrix components giving the inter-cluster distances (between verification and forecast fields) for each iteration of CA for each field.
min.intercluster.dists: numeric vector givng the minimum values inter.cluster.dist at each iteration. Used to determine the cut-off for matched objects.

The summary method function returns a list with the same components as above, but also the components:

cutoff: The cut-off value used for determining matches.
csi,AvgErr: NCo by NCf numeric matrix giving the critical success index (CSI) and average intercluster error (distance) based on matched/un-matched objects.
HMF: NCo by NCf by 3 array giving the hits, misses and false alarms based on matched/un-matched objects.

If the argument a is not NULL, then these are returned as attributes of the returned object. In the case of “SpatialVx” objects, the attributes are preserved.

plot and print methods do not return anything.

Arguments

X,Y

clusterer default method, these are m by n matrices giving the verification and forecast fields, resp.

“SpatialVx” method function, X is an object of class “SpatialVx” and Y is not used (a warning is given if it is not missing and not NULL).

object,x

list object of class “clusterer” as returned by clusterer (or summary.clusterer in the case of plot.summary.clusterer).

xloc

(optional) numeric mn by 2 matrix giving the gridpoint locations. If NULL, this will be created using 1:m and 1:n.

xyp

logical, should the cluster analysis be performed on the locations and intensities (TRUE) or only the locations (FALSE)?

threshold

numeric of length one or two giving the threshold to apply to each field (>=). If length is two, the first value corresponds to the threshold for the verification field, and the second to the foreast field.

linkage.method

character naming a valid linkage method accepted by hclust.

stand

logical, should the data matrices consisting of xloc and each field first be standardized before performing cluster analysis?

trans

character naming a function to be applied to the field intensities before performing the CA. Only used if xyp is TRUE. Default applies no transformation.

time.point

numeric or character indicating which time point from the “SpatialVx” verification set to select for analysis.

obs, model

numeric indicating which observation/forecast model to select for the analysis.

a

(optional) list giving object attributes associated with a “SpatialVx” class object. The clusterer method for “SpatialVx” objects calls the default method function, and uses this argument to pass the attributes through to the final returned object, as well as to grab location information.

mfrow

mfrow parameter (see help file for par). If NULL, then the parameter is not re-set.

col

color vector for image plots of fields after applying the threshold(s).

horizontal

logical, should the image plot color legend be placed horizontally or vertically? Only for image plot sof the fields.

verbose

logical, should progress information be printed to the screen?

...

optional arguments to the hclust function. In the case of the summary method function, z and/or sigma giving a numeric value used to find the cut-off given by median + z*sigma for detemining matched objects (see Marzban and Sandgathe 2006) where defaults of 1 and the standard deviation of minimum inter-cluster distances are used, and silent (logical should information be printed to the screen (FALSE) or not (TRUE); default is to print to the screen. In the case of the plot method functions, these are optional arguments to the summary method function.

Author

Eric Gilleland

Warning

Although some effort has been put into making the functions in this package as computationally efficient as possible, there is a lot of bookeeping involved with this approach, and the current functions are probably not as efficient as they could be. In any case, they will likely be slow for large data sets. The function can work quickly on large fields if an adequately high threshold is used (e.g., if threshold=10 is replaced for 16 in the not run example below, the function is VERY slow). Performing the actual cluster analysis on each field is fast because the hclust function from the fastcluster package is used, which works very well. However, bookeeping after the CA is done employs a lot of loops within loops, which possibly can be made more efficient (and maybe someday will be), but for now...

If it is desired to simply look at the CA for the two fields, the function hclust from fastcluster can be used, which essentially replaces the hclust function from the stats package with a faster version, but otherwise operates the same as far as what is returned, etc., and the same method functions can be employed.

Details

This function performs cluster analysis (CA) on positive values from each of two fields in a verification set using the hclust function from package fastcluster. Inter-cluster distances are computed between each cluster of each field at every level of the CA. The function clusterer performs CA on both fields, and finds the inter-cluster distances across fields for every possible combination of objects at each iteration of each CA. The summary method function finishes the analysis by determining hits, misses and false alarms as well as the numbers of clusters. It also computes CSI for each number of cluster combinations. This is the verification approach described in Marzban and Sandgathe (2006).

The plot method function creates a 4 by 2 panel of plots. The top two plots give image plots of the verification and forecast fields with grid points below the threshold(s) showing zero. The next two plots are dendrograms as performed by the plot method function for hclust (dendrogram) objects. The next row gives a histogram of the minimum inter-cluster distances, then box plots showing the hits, misses and false alarms for every possible combination of levels of each CA. Finally, the bottom two plots show, for each combination of CA level (i.e., numbers of clusters), the CSI and average error (inter-cluster distance) for all matched objects. These last three plots are the ones made by the plot method for values returned from the summary method function.

print is currently not very useful here, but it prevents printing a big mess to the screen.

References

Marzban, C. and Sandgathe, S. (2006) Cluster analysis for verification of precipitation fields. Wea. Forecasting, 21, 824--838.

Examples

Run this code

data( "UKobs6" )
data( "UKfcst6" )
look <- clusterer(X=UKobs6, Y=UKfcst6, threshold=16, trans="log", verbose=TRUE)
plot( look )

if (FALSE) {
data( "UKloc" )

# Now, do the same thing, but using a "SpatialVx" object.
hold <- make.SpatialVx( UKobs6, UKfcst6, loc = UKloc, map = TRUE,
    field.type = "Rainfall", units = "mm/h",
    data.name = "Nimrod", obs.name = "obs 6", model.name = "fcst 6" )

look2 <- clusterer(hold, threshold=16, trans="log", verbose=TRUE)
plot( look2 )
# Note that values differ because now we're using the
# actual locations instead of integer indicators of
# positions.
}

Run the code above in your browser using DataLab