nearestCentroidPredictor: Nearest centroid predictor

Description

Nearest centroid predictor for binary (i.e., two-outcome) data. Implements a whole host of options and improvements such as accounting for within-class heterogeneity using sample networks, various ways of feature selection and weighing etc.

Usage

nearestCentroidPredictor(
    # Input training and test data
    x, y, 
    xtest = NULL, 
    # Feature weights and selection criteria
    featureSignificance = NULL, 
    assocFnc = "cor", assocOptions = "use = 'p'", 
    assocCut.hi = NULL, assocCut.lo = NULL, 
    nFeatures.hi = 10, nFeatures.lo = 10,
    weighFeaturesByAssociation = 0, 
    scaleFeatureMean = TRUE, scaleFeatureVar = TRUE, 
    # Predictor options 
    centroidMethod = c("mean", "eigensample"), 
    simFnc = "cor", simOptions = "use = 'p'", 
    useQuantile = NULL, 
    sampleWeights = NULL, 
    weighSimByPrediction = 0, 
    # What should be returned
    CVfold = 0, returnFactor = FALSE, 
    # General options
    randomSeed = 12345, 
    verbose = 2, indent = 0)

Arguments

Training features (predictive variables). Each column corresponds to a feature and each row to an observation.

The response variable. Can be a single vector or a matrix with arbitrary many columns. Number of rows (observations) must equal to the number of rows (observations) in x.

xtest

Optional test set data. A matrix of the same number of columns (i.e., features) as x. If test set data are not given, only the prediction on training data will be returned.

featureSignificance

Optional vector of feature significance for the response variable. If given, it is used for feature selection (see details). Should preferably be signed, that is features can have high negative significance.

assocFnc

Character string specifying the association function. The association function should behave roughly as link{cor} in that it takes two arguments (a matrix and a vector) plus options and returns the vector of associations between the columns of the matrix and the vector. The associations may be signed (i.e., negative or positive).

assocOptions

Character string specifying options to the association function.

assocCut.hi

Association (or featureSignificance) threshold for including features in the predictor. Features with associtation higher than assocCut.hi will be included. If not given, the threshold method will not be used; instead, a fixed number of features will be included as specified by nFeatures.hi and nFeatures.lo.

assocCut.lo

Association (or featureSignificance) threshold for including features in the predictor. Features with associtation lower than assocCut.lo will be included. If not given, defaults to -assocCut.hi. If assocCut.hi is NULL, the threshold method will not be used; instead, a fixed number of features will be included as specified by nFeatures.hi and nFeatures.lo.

nFeatures.hi

Number of highest-associated features (or features with highest featureSignificance) to include in the predictor. Only used if assocCut.hi is NULL.

nFeatures.lo

Number of lowest-associated features (or features with highest featureSignificance) to include in the predictor. Only used if assocCut.hi is NULL.

weighFeaturesByAssociation

(Optional) power to downweigh features that are less associated with the response. See details.

scaleFeatureMean

Logical: should the training features be scaled to mean zero? Unless there are good reasons not to scale, the features should be scaled.

scaleFeatureVar

Logical: should the training features be scaled to unit variance? Again, unless there are good reasons not to scale, the features should be scaled.

centroidMethod

One of "mean" and "eigensample", specifies how the centroid should be calculated. "mean" takes the mean across all samples (or all samples within a sample module, if sample networks are used), whereas "eigensample" calculates the first principal component of the feature matrix and uses that as the centroid.

simFnc

Character string giving the similarity function for measuring the similarity between test samples and centroids. This function should behave roughly like the function cor in that it takes two arguments (x, y) and calculates the pair-wise similarities between columns of x and y. For convenience, the value "dist" is treated specially: the Euclidean distance between the columns of x and y is calculated and its negative is returned (so that smallest distance corresponds to highest similarity). Since values of this function are only used for ranking centroids, its values are not restricted to be positive or within certain bounds.

simOptions

Character string specifying the options to the similarity function.

useQuantile

If non-NULL, the "nearest quantiloid" will be used instead of the nearest centroid. See details.

sampleWeights

Optional specification of sample weights. Useful for example if one wants to explore boosting.

weighSimByPrediction

(Optional) power to downweigh features that are not well predicted between training and test sets. See details.

CVfold

Non-negative integer specifying cross-validation. Zero means no cross-validation will be performed. values above zero specify the number of samples to be considered test data for each step of cross-validation.

returnFactor

Logical: should a factor be returned?

randomSeed

Integere specifying the seed for the random number generator. If NULL, the seed will not be set. See set.seed.

verbose

Integer controling how verbose the diagnostic messages should be. Zero means silent.

indent

Indentation for the diagnostic messages. Zero means no indentation, each unit adds two spaces.

Value

A list with the following components:

predicted

The back-substitution prediction in the training set.

predictedTest

Prediction in the test set.

featureSignificance

A vector of feature significance calculated by assocFnc or a copy of the input featureSignificance if the latter is non-NULL.

selectedFeatures

A vector giving the indices of the features that were selected for the predictor.

centroidProfile

The representative profiles of each class (or cluster). Only returned in useQuntile is NULL.

testSample2centroidSimilarities

A matrix of calculated similarities between the test samples and class/cluster centroids.

featureValidationWeights

A vector of validation weights (see Details) for the selected features. If weighFeaturesByValidation is 0, a unit vector is used and returned.

CVpredicted

Cross-validation prediction on the training data. Present only if CVfold is non-zero.

sampleClusterLabels

A list with two components (one per class). Each component is a vector of sample cluster labels for samples in the class.

Details

Nearest centroid predictor works by forming a representative profile (centroid) across features for each class from the training data, then assigning each test sample to the class of the nearest representative profile. The representative profile can be formed either as mean or as athe first principal component ("eigensample"; this choice is governed by the option centroidMethod).

When the number of features is large and only a small fraction is likely to be associated with the outcome, feature selection can be used to restrict the features that actually enter the centroid. Feature selection can be based either on their association with the outcome calculated from the training data using assocFnc, or on user-supplied feature significance (e.g., derived from literature, argument featureSignificance). In either case, features can be selected by high and low association tresholds or by taking a fixed number of highest- and lowest-associated features.

As an alternative to centroids, the predictor can also assign test samples based on a given quantile of the distances from the training samples in each class (argument useQuantile). This may be advantageous if the samples in each class form irregular clusters. Note that setting useQuantile=0 (i.e., using minimum distance in each class) essentially gives a nearest neighbor predictor: each test sample will be assigned to the class of its nearest training neighbor.

If features exhibit non-trivial correlations among themselves (such as, for example, in gene expression data), one can attempt to down-weigh features that do not exhibit the same correlation in the test set. This is done by using essentially the same predictor to predict _features_ from all other features in the test data (using the training data to train the feature predictor). Because test features are known, the prediction accuracy can be evaluated. If a feature is predicted badly (meaning the error in the test set is much larger than the error in the cross-validation prediction in training data), it may mean that its quality in the training or test data is low (for example, due to excessive noise or outliers). Such features can be downweighed using the argument weighByPrediction. The extra factor is min(1, (root mean square prediction error in test set)/(root mean square cross-validation prediction error in the trainig data)^weighByPrediction), that is it is never bigger than 1.

Unless the features' mean and variance can be ascribed clear meaning, the (training) features should be scaled to mean 0 and variance 1 before the centroids are formed.

The function implements a basic option for removal of spurious effects in the training and test data, by removng a fixed number of leading principal components from the features. This sometimes leads to better prediction accuracy but should be used with caution.

If samples within each class are heterogenous, a single centroid may not represent each class well. This function can deal with within-class heterogeneity by clustering samples (separately in each class), then using a one representative (mean, eigensample) or quantile for each cluster in each class to assign test samples. Various similarity measures, specified by adjFnc, can be used to construct the sample network adjacency. Similarly, the user can specify a clustering function using clusteringFnc. The requirements on the clustering function are described in a separate section below.

Description

Usage

Arguments

Value

Details

See Also