createCV: Define Cross-Validation Groups

Description

Creates a matrix that specifies cross-validation schemes.

Usage

createCV(STmodel, groups = 10, min.dist = 0.1,
    random = FALSE, subset = NA,
    option = c("all", "fixed", "comco", "snapshot", "home"),
    Icv.vector = TRUE)

Arguments

STmodel

Model object for which to determine cross-validation.

groups

Number of cross-validation groups, zero gives leave-one-out cross-validation.

min.dist

Minimum distance between locations for them to end up in separate groups. Points closer than min.dist will be forced into the same group. A high value for min.dist can result in fewer cross-validation groups than specified in groups.

random

If FALSE repeated calls to the function will return the same grouping, if TRUE repeated calls will give different CV-groupings. Ensures that simulation studies are reproducable.

subset

A subset of locations for which to define the cross-validation setup. Only sites listed in subset are dropped from one of the cross-validation groups; in other words sites not in subset are used for estimation and preidiction of all cross-validation groups. This option is ignored if option!="all".

option

For internal MESA Air usage, see Details below.

Icv.vector

Attempt to return a vector instead of a matrix. If the same observation is in several groups a matrix will still be returned.

Value

Return a vector, with each element giving the CV-group (as an integer) of each observation; Or a (number or observations) - by - (groups) logical matrix; each column defines a cross-validation set with the TRUE values marking the observations to be left out.

Details

The number of observations left out of each group can be rather uneven; the main goal of createCV is to create CV-groups such that the groups contain roughly the same number of locations ignoring the number of observations at each location. If there are large differences in the number of observations at differnt locations one could use the subset option to create different CV-groupings for different types of locations. If Icv.vector=FALSE, the groups can then be combined as I.final = I.1 | I.2 | I.3.

The option input determines which sites to include in the cross-validation. Possible options are "all", "fixed", "comco", "snapshot" and "home".

all: Uses all available sites, possibly subset according to subset. The sites will be grouped with sites seperated by less than min.dist being put in the same CV-group.
fixed: Uses only sites that have STmodel$locations$type %in% c("AQS","FIXED"). Given the subsettting the sites will be grouped as for all.
home: Uses only sites that have STmodel$locations$type %in% c("HOME"). Given the subsettting the sites will be grouped as for all.
comco: Uses only sites that have STmodel$locations$type %in% c("COMCO"). The sites will be grouped together if they are from the same road gradient. The road gradients are grouped by studying the name of the sites. With "?" denoting one or more letters and "#" denoting one or more digits the names are expected to follow "?-?#?#", for random sites, and "?-?#?#?" for the gradients (with all but the last letter being the same for the entire gradient).

Examples

Run this code

# NOT RUN {
##load the data
data(mesa.model)

##create a matrix with the CV-schemes
I.cv <- createCV(mesa.model, groups=10)

##number of observations in each CV-group
table(I.cv)

##Which sites belong to which groups?
ID.cv <- sapply(split(mesa.model$obs$ID, I.cv),unique)
print(ID.cv)

##Note that the sites with distance 0.084<min.dist 
##are grouped together (in group 10).
mesa.model$D.beta[ID.cv[[10]], ID.cv[[10]]]

##Find out which location belongs to which cv group
I.col <- apply(sapply(ID.cv,function(x) mesa.model$locations$ID
               %in% x), 1, function(x) if(sum(x)==1) which(x) else 0)
names(I.col) <- mesa.model$locations$ID
print(I.col)

##Plot the locations, colour coded by CV-grouping
plot(mesa.model$locations$long, mesa.model$locations$lat,
     pch=23+floor(I.col/max(I.col)+.5), bg=I.col, 
     xlab="Longitude", ylab="Latitude")

###############################################################
## Using matrix representation of cross-validation structure ##
###############################################################

##create a matrix with the CV-schemes
I.cv <- createCV(mesa.model, groups=10, Icv.vector=FALSE)

##number of observations in each CV-group
colSums(I.cv)

##Which sites belong to which groups?
ID.cv <- apply(I.cv, 2, function(x){ unique(mesa.model$obs$ID[x]) })

##and then as above...
# }

Run the code above in your browser using DataLab