nearestCentroidPredictor( # Input training and test data
x, y,
xtest = NULL,
# Feature weights and selection criteria
featureSignificance = NULL,
assocFnc = "cor", assocOptions = "use = 'p'",
assocCut.hi = NULL, assocCut.lo = NULL,
nFeatures.hi = 10, nFeatures.lo = 10,
weighFeaturesByAssociation = 0,
scaleFeatureMean = TRUE, scaleFeatureVar = TRUE,
# Predictor options
centroidMethod = c("mean", "eigensample"),
simFnc = "cor", simOptions = "use = 'p'",
useQuantile = NULL,
sampleWeights = NULL,
weighSimByPrediction = 0,
# What should be returned
CVfold = 0, returnFactor = FALSE,
# General options
randomSeed = 12345,
verbose = 2, indent = 0)
x
.
If test set data are not given, only the prediction on training data will be returned.link{cor}
in that it takes two arguments
(a matrix and a vector) plus options
and returns the vector of associations between the columns assocCut.hi
will be included. If not given, the threshold method will not be
used; instead, a fixed number of featuassocCut.lo
will be included. If not given, defaults to -assocCut.hi
.
If assocCut.hi
is <featureSignificance
) to include in the
predictor. Only used if assocCut.hi
is NULL
.featureSignificance
) to include in
the predictor. Only used if assocCut.hi
is NULL
."mean"
and "eigensample"
, specifies how the centroid should be calculated.
"mean"
takes the mean across all samples (or all samples within a sample module, if sample networks
are used), whereas "eigensam
cor
in that it takes two arguments (x
NULL
, the seed will not be set. See
set.seed
.assocFnc
or a copy of the
input featureSignificance
if the latter is non-NULL.useQuntile
is NULL
.weighFeaturesByValidation
is 0, a unit vector is used and returned.CVfold
is
non-zero.centroidMethod
).When the number of features is large and only a small fraction is likely to be associated with the outcome,
feature selection can be used to restrict the features that actually enter the centroid. Feature selection
can be based either on their association with the outcome
calculated from the training data using assocFnc
, or on user-supplied feature significance (e.g.,
derived from literature, argument
featureSignificance
). In either case, features can be selected by high and low association tresholds
or by taking a fixed number of highest- and lowest-associated features.
As an alternative to centroids, the predictor can also assign test samples based on a given quantile of the
distances from the training samples in each class (argument useQuantile
). This may be advantageous if
the samples in each class form irregular clusters. Note that setting useQuantile=0
(i.e., using
minimum distance in each class) essentially gives a nearest neighbor predictor: each test sample will be
assigned to the class of its nearest training neighbor.
If features exhibit non-trivial correlations among themselves (such as, for example, in gene expression
data), one can attempt to down-weigh features that do not exhibit the same correlation in the test set.
This is done by using essentially the same predictor to predict _features_ from all other features in the
test data (using the training data to train the feature predictor). Because test features are known, the
prediction accuracy can be evaluated. If a feature is predicted badly (meaning the error in the test set is
much larger than the error in the cross-validation prediction in training data),
it may mean that its quality in the
training or test data is low (for example, due to excessive noise or outliers).
Such features can be downweighed using the argument weighByPrediction
. The extra factor is
min(1, (root mean square prediction error in test set)/(root mean square cross-validation prediction error
in
the trainig data)^weighByPrediction), that is it is never bigger than 1.
Unless the features' mean and variance can be ascribed clear meaning, the (training) features should be scaled to mean 0 and variance 1 before the centroids are formed.
The function implements a basic option for removal of spurious effects in the training and test data, by removng a fixed number of leading principal components from the features. This sometimes leads to better prediction accuracy but should be used with caution.
If samples within each class are heterogenous, a single centroid may not represent each class well. This
function can deal with within-class heterogeneity by clustering samples (separately in each class), then
using a one representative (mean, eigensample) or quantile for each cluster in each class to assign test
samples. Various similarity measures, specified by adjFnc
, can be used to construct the sample network
adjacency. Similarly, the user can specify a clustering function using clusteringFnc
. The
requirements on the clustering function are described in a separate section below.
votingLinearPredictor