wilma: Supervised Clustering of Predictor Variables

Description

Performs supervised clustering of predictor variables for large (microarray gene expression) datasets. Works in a greedy forward strategy and optimizes a combination of the Wilcoxon and Margin statistics for finding the clusters.

Usage

wilma(x, y, noc, genes = NULL, flip = TRUE, once.per.clust = FALSE, trace = 0)

Arguments

Numeric matrix of explanatory variables (\(p\) variables in columns, \(n\) cases in rows). For example, these can be microarray gene expression data which should be clustered.

Numeric vector of length \(n\) containing the class labels of the individuals. These labels have to be coded by 0 and 1.

noc

Integer, the number of clusters that should be searched for on the data.

genes

Defaults to NULL. An optional list (of length noc) of vectors containing the indices (column numbers) of the previously known initial clusters.

flip

Logical, defaults to TRUE. Is indicating whether the clustering should be done with or without sign-flipping.

once.per.clust

Logical, defaults to FALSE. Is indicating if each variable (gene) should only be allowed to enter into each cluster once; equivalently, the cluster mean profile has only weights \(\pm 1\) for each variable.

trace

Integer >= 0; when positive, the output of the internal loops is provided; trace >= 2 provides output even from the internal C routines.

Value

wilma returns an object of class "wilma". The functions print and summary are used to obtain an overview of the clusters that have been found. The function plot yields a two-dimensional projection into the space of the first two clusters that wilma found. The generic function fitted returns the fitted values, these are the cluster representatives. Finally, predict is used for classifying test data on the basis of Wilma's cluster with either the nearest-neighbor-rule, diagonal linear discriminant analysis, logistic regression or aggregated trees.

An object of class "wilma" is a list containing:

clist

A list of length noc, containing integer vectors consisting of the indices (column numbers) of the variables (genes) that have been clustered.

steps

Numerical vector of length noc, showing the number of forward/backward cycles in the fitting process of each cluster.

Numeric vector of length \(n\) containing the class labels of the individuals. These labels have to be coded by 0 and 1.

x.means

A list of length noc, containing numerical matrices consisting of the cluster representatives after insertion of each variable.

noc

Integer, the number of clusters that has been searched for on the data.

signs

Numerical vector of length \(p\), saying whether the \(i\)th variable (gene) should be sign-flipped (-1) or not (+1).

References

Marcel Dettling (2002) Supervised Clustering of Genes, see https://stat.ethz.ch/~dettling/supercluster.html

Marcel Dettling and Peter B<U+00FC>hlmann (2002). Supervised Clustering of Genes. Genome Biology, 3(12): research0069.1-0069.15, 10.1186/gb-2002-3-12-research0069 .

Marcel Dettling and Peter B<U+00FC>hlmann (2004). Finding Predictive Gene Groups from Microarray Data. Journal of Multivariate Analysis 90, 106--131, 10.1016/j.jmva.2004.02.012 .

Examples

Run this code

# NOT RUN {
## Working with a "real" microarray dataset
data(leukemia, package="supclust")

## Generating random test data: 3 observations and 250 variables (genes)
set.seed(724)
xN <- matrix(rnorm(750), nrow = 3, ncol = 250)

## Fitting Wilma
fit  <- wilma(leukemia.x, leukemia.y, noc = 3, trace = 1)

## Working with the output
fit
summary(fit)
plot(fit)
fitted(fit)

## Fitted values and class predictions for the training data
predict(fit, type = "cla")
predict(fit, type = "fitt")

## Predicting fitted values and class labels for test data
predict(fit, newdata = xN)
predict(fit, newdata = xN, type = "cla", classifier = "nnr", noc = c(1,2,3))
predict(fit, newdata = xN, type = "cla", classifier = "dlda", noc = c(1,3))
predict(fit, newdata = xN, type = "cla", classifier = "logreg")
predict(fit, newdata = xN, type = "cla", classifier = "aggtrees")
# }

Run the code above in your browser using DataLab