Learn R Programming

sparcl (version 1.0.3)

HierarchicalSparseCluster: Hierarchical sparse clustering

Description

Performs sparse hierarchical clustering. If $d_ii'j$ is the dissimilarity between observations i and i' for feature j, seek a sparse weight vector w and then use $(sum_j (d_ii'j w_j))_ii'$ as a nxn dissimilarity matrix for hierarchical clustering.

Usage

HierarchicalSparseCluster(x=NULL, dists=NULL,method=c("average","complete", "single","centroid"), wbound=NULL,niter=15,dissimilarity=c("squared.distance","absolute.value"), uorth=NULL,silent=FALSE, cluster.features=FALSE,method.features=c("average", "complete", "single","centroid"),output.cluster.files=FALSE, outputfile.prefix="output",genenames=NULL,genedesc=NULL,standardize.arrays=FALSE) "print"(x,...) "plot"(x,...)

Arguments

x
A nxp data matrix; n is the number of observations and p the number of features. If NULL, then specify dists instead.
dists
For advanced users, can be entered instead of x. If HierarchicalSparseCluster has already been run on this data, then the dists value of the previous output can be entered here. Under normal circumstances, leave this argument NULL and pass in x instead.
method
The type of linkage to use in the hierarchical clustering - "single", "complete", "centroid", or "average".
wbound
The L1 bound on w to use; this is the tuning parameter for sparse hierarchical clustering. Should be greater than 1.
niter
The number of iterations to perform in the sparse hierarchical clustering algorithm.
dissimilarity
The type of dissimilarity measure to use. One of "squared.distance" or "absolute.value". Only use this if x was passed in (rather than dists).
uorth
If complementary sparse clustering is desired, then this is the nxn dissimilarity matrix obtained in the original sparse clustering.
standardize.arrays
Should the arrays be standardized? Default is FALSE.
silent
Print out progress?
cluster.features
Not for use.
method.features
Not for use.
output.cluster.files
Not for use.
outputfile.prefix
Not for use.
genenames
Not for use.
genedesc
Not for use.
...
not used.

Value

hc
The output of a call to "hclust", giving the results of hierarchical sparse clustering.
ws
The p-vector of feature weights.
u
The nxn dissimilarity matrix passed into hclust, of the form $(sum_j w_j d_ii'j)_ii'$.
dists
The (n*n)xp dissimilarity matrix for the data matrix x. This is useful if additional calls to HierarchicalSparseCluster will be made.

Details

We seek a p-vector of weights w (one per feature) and a nxn matrix U that optimize

$maximize_U,w sum_j w_j sum_ii' d_ii'j U_ii'$ subject to $||w||_2 <= 1,="" ||w||_1="" <="wbound," w_j="">= 0, sum_ii' U_ii'^2 <= 1$.<="" p="">

Here, $d_ii'j$ is the dissimilarity between observations i and i' with along feature j. The resulting matrix U is used as a dissimilarity matrix for hierarchical clustering. "wbound" is a tuning parameter for this method, which controls the L1 bound on w, and as a result the number of features with non-zero $w_j$ weights. The non-zero elements of w indicate features that are used in the sparse clustering.

We optimize the above criterion with an iterative approach: hold U fixed and optimize with respect to w. Then, hold w fixed and optimize with respect to U.

Note that the arguments described as "Not for use" are included for the sparcl package to function with GenePattern but should be ignored by the R user.

References

Witten and Tibshirani (2009) A framework for feature selection in clustering.

See Also

HierarchicalSparseCluster.permute,KMeansSparseCluster,KMeansSparseCluster.permute

Examples

Run this code
  # Generate 2-class data
  set.seed(1)
  x <- matrix(rnorm(100*50),ncol=50)
  y <- c(rep(1,50),rep(2,50))
  x[y==1,1:25] <- x[y==1,1:25]+2
  # Do tuning parameter selection for sparse hierarchical clustering
  perm.out <- HierarchicalSparseCluster.permute(x, wbounds=c(1.5,2:6),
nperms=5)
  print(perm.out)
  plot(perm.out)
  # Perform sparse hierarchical clustering
  sparsehc <- HierarchicalSparseCluster(dists=perm.out$dists,
wbound=perm.out$bestw, method="complete")
  # faster than   sparsehc <- HierarchicalSparseCluster(x=x,wbound=perm.out$bestw, method="complete")
  par(mfrow=c(1,2))
  plot(sparsehc)
  plot(sparsehc$hc, labels=rep("", length(y)))
  print(sparsehc)
  # Plot using knowledge of class labels in order to compare true class
  #   labels to clustering obtained
  par(mfrow=c(1,1))
  ColorDendrogram(sparsehc$hc,y=y,main="My Simulated Data",branchlength=.007)
  # Now, what if we want to see if out data contains a *secondary*
  #   clustering after accounting for the first one obtained. We
  #   look for a complementary sparse clustering:
  sparsehc.comp <- HierarchicalSparseCluster(x,wbound=perm.out$bestw,
     method="complete",uorth=sparsehc$u)
  # Redo the analysis, but this time use "absolute value" dissimilarity:
  perm.out <- HierarchicalSparseCluster.permute(x, wbounds=c(1.5,2:6),
    nperms=5, dissimilarity="absolute.value")
  print(perm.out)
  plot(perm.out)
  # Perform sparse hierarchical clustering
  sparsehc <- HierarchicalSparseCluster(dists=perm.out$dists, wbound=perm.out$bestw, method="complete", dissimilarity="absolute.value")
  par(mfrow=c(1,2))
  plot(sparsehc)

Run the code above in your browser using DataLab