HierarchicalSparseCluster: Hierarchical sparse clustering

Description

Performs sparse hierarchical clustering. If $d_ii'j$ is the dissimilarity between observations i and i' for feature j, seek a sparse weight vector w and then use $(sum_j (d_ii'j w_j))_ii'$ as a nxn dissimilarity matrix for hierarchical clustering.

Usage

HierarchicalSparseCluster(x=NULL, dists=NULL,method=c("average","complete", "single","centroid"),
wbound=NULL,niter=15,dissimilarity=c("squared.distance","absolute.value"), uorth=NULL,silent=FALSE,
cluster.features=FALSE,method.features=c("average", "complete",
"single","centroid"),output.cluster.files=FALSE,
outputfile.prefix="output",genenames=NULL,genedesc=NULL,standardize.arrays=FALSE)
"print"(x,...)
"plot"(x,...)

Arguments

A nxp data matrix; n is the number of observations and p the number of features. If NULL, then specify dists instead.

dists

For advanced users, can be entered instead of x. If HierarchicalSparseCluster has already been run on this data, then the dists value of the previous output can be entered here. Under normal circumstances, leave this argument NULL and pass in x instead.

method

The type of linkage to use in the hierarchical clustering - "single", "complete", "centroid", or "average".

wbound

The L1 bound on w to use; this is the tuning parameter for sparse hierarchical clustering. Should be greater than 1.

niter

The number of iterations to perform in the sparse hierarchical clustering algorithm.

dissimilarity

The type of dissimilarity measure to use. One of "squared.distance" or "absolute.value". Only use this if x was passed in (rather than dists).

uorth

If complementary sparse clustering is desired, then this is the nxn dissimilarity matrix obtained in the original sparse clustering.

standardize.arrays

Should the arrays be standardized? Default is FALSE.

silent

Print out progress?

cluster.features

Not for use.

method.features

Not for use.

output.cluster.files

Not for use.

outputfile.prefix

Not for use.

genenames

Not for use.

genedesc

Not for use.

...

not used.

Value

hc: The output of a call to "hclust", giving the results of hierarchical sparse clustering.
ws: The p-vector of feature weights.
u: The nxn dissimilarity matrix passed into hclust, of the form $(sum_j w_j d_ii'j)_ii'$.
dists: The (n*n)xp dissimilarity matrix for the data matrix x. This is useful if additional calls to HierarchicalSparseCluster will be made.

Details

We seek a p-vector of weights w (one per feature) and a nxn matrix U that optimize

$maximize_U,w sum_j w_j sum_ii' d_ii'j U_ii'$ subject to $||w||_2 <= 1,="" ||w||_1="" <="wbound," w_j="">= 0, sum_ii' U_ii'^2 <= 1$.<="" p="">

Here, $d_ii'j$ is the dissimilarity between observations i and i' with along feature j. The resulting matrix U is used as a dissimilarity matrix for hierarchical clustering. "wbound" is a tuning parameter for this method, which controls the L1 bound on w, and as a result the number of features with non-zero $w_j$ weights. The non-zero elements of w indicate features that are used in the sparse clustering.

We optimize the above criterion with an iterative approach: hold U fixed and optimize with respect to w. Then, hold w fixed and optimize with respect to U.

Note that the arguments described as "Not for use" are included for the sparcl package to function with GenePattern but should be ignored by the R user.

References

Witten and Tibshirani (2009) A framework for feature selection in clustering.

Examples

Run this code

  # Generate 2-class data
  set.seed(1)
  x <- matrix(rnorm(100*50),ncol=50)
  y <- c(rep(1,50),rep(2,50))
  x[y==1,1:25] <- x[y==1,1:25]+2
  # Do tuning parameter selection for sparse hierarchical clustering
  perm.out <- HierarchicalSparseCluster.permute(x, wbounds=c(1.5,2:6),
nperms=5)
  print(perm.out)
  plot(perm.out)
  # Perform sparse hierarchical clustering
  sparsehc <- HierarchicalSparseCluster(dists=perm.out$dists,
wbound=perm.out$bestw, method="complete")
  # faster than   sparsehc <- HierarchicalSparseCluster(x=x,wbound=perm.out$bestw, method="complete")
  par(mfrow=c(1,2))
  plot(sparsehc)
  plot(sparsehc$hc, labels=rep("", length(y)))
  print(sparsehc)
  # Plot using knowledge of class labels in order to compare true class
  #   labels to clustering obtained
  par(mfrow=c(1,1))
  ColorDendrogram(sparsehc$hc,y=y,main="My Simulated Data",branchlength=.007)
  # Now, what if we want to see if out data contains a *secondary*
  #   clustering after accounting for the first one obtained. We
  #   look for a complementary sparse clustering:
  sparsehc.comp <- HierarchicalSparseCluster(x,wbound=perm.out$bestw,
     method="complete",uorth=sparsehc$u)
  # Redo the analysis, but this time use "absolute value" dissimilarity:
  perm.out <- HierarchicalSparseCluster.permute(x, wbounds=c(1.5,2:6),
    nperms=5, dissimilarity="absolute.value")
  print(perm.out)
  plot(perm.out)
  # Perform sparse hierarchical clustering
  sparsehc <- HierarchicalSparseCluster(dists=perm.out$dists, wbound=perm.out$bestw, method="complete", dissimilarity="absolute.value")
  par(mfrow=c(1,2))
  plot(sparsehc)

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

Details

References

See Also

Examples