projectiveKMeans: Projective K-means (pre-)clustering of expression data

Description

Implementation of a variant of K-means clustering for expression data.

Usage

projectiveKMeans(
  datExpr, 
  preferredSize = 5000, 
  nCenters = as.integer(min(ncol(datExpr)/20, preferredSize^2/ncol(datExpr))),
  sizePenaltyPower = 4, 
  networkType = "unsigned",  
  randomSeed = 54321,
  checkData = TRUE,
  imputeMissing = TRUE,
  maxIterations = 1000, 
  verbose = 0, indent = 0)

Arguments

datExpr

expression data. A data frame in which columns are genes and rows ar samples. NAs are allowed, but not too many.

preferredSize

preferred maximum size of clusters.

nCenters

number of initial clusters. Empirical evidence suggests that more centers will give a better preclustering; the default is an attempt to arrive at a reasonable number.

sizePenaltyPower

parameter specifying how severe is the penalty for clusters that exceed preferredSize.

networkType

network type. Allowed values are (unique abbreviations of) "unsigned", "signed", "signed hybrid". See adjacency.

randomSeed

integer to be used as seed for the random number generator before the function starts. If a current seed exists, it is saved and restored upon exit.

checkData

logical: should data be checked for genes with zero variance and genes and samples with excessive numbers of missing samples? Bad samples are ignored; returned cluster assignment for bad genes will be NA.

imputeMissing

logical: should missing values in datExpr be imputed before the calculations start? The early imputation makes the code run faster but may produce slightly different results if re-running older calculations.

maxIterations

maximum iterations to be attempted.

verbose

integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.

indent

indentation for diagnostic messages. Zero means no indentation, each unit adds two spaces.

Value

A list with the following components:

clusters

a numerical vector with one component per input gene, giving the cluster number in which the gene is assigned.

centers

cluster centers, that is their first principal components.

Details

The principal aim of this function within WGCNA is to pre-cluster a large number of genes into smaller blocks that can be handled using standard WGCNA techniques.

This function implements a variant of K-means clustering that is suitable for co-expression analysis. Cluster centers are defined by the first principal component, and distances by correlation (more precisely, 1-correlation). The distance between a gene and a cluster is multiplied by a factor of \(max(clusterSize/preferredSize, 1)^{sizePenaltyPower}\), thus penalizing clusters whose size exceeds preferredSize. The function starts with randomly generated cluster assignment (hence the need to set the random seed for repeatability) and executes interations of calculating new centers and reassigning genes to nearest center until the clustering becomes stable. Before returning, nearby clusters are iteratively combined if their combined size is below preferredSize.

The standard principal component calculation via the function svd fails from time to time (likely a convergence problem of the underlying lapack functions). Such errors are trapped and the principal component is approximated by a weighted average of expression profiles in the cluster. If verbose is set above 2, an informational message is printed whenever this approximation is used.