These functions provide an interface to several clustering methods
implemented in R, for use together with the cluster stability
assessment in clusterboot
(as parameter
clustermethod
; "CBI" stands for "clusterboot interface").
In some situations it could make sense to use them to compute a
clustering even if you don't want to run clusterboot
, because
some of the functions contain some additional features (e.g., normal
mixture model based clustering of dissimilarity matrices projected
into the Euclidean space by MDS or partitioning around medoids with
estimated number of clusters, noise/outlier identification in
hierarchical clustering).
kmeansCBI(data,krange,k,scaling=FALSE,runs=1,criterion="ch",...)hclustCBI(data,k,cut="number",method,scaling=TRUE,noisecut=0,...)
hclusttreeCBI(data,minlevel=2,method,scaling=TRUE,...)
disthclustCBI(dmatrix,k,cut="number",method,noisecut=0,...)
noisemclustCBI(data,G,k,emModelNames,nnk,hcmodel=NULL,Vinv=NULL,
summary.out=FALSE,...)
distnoisemclustCBI(dmatrix,G,k,emModelNames,nnk,
hcmodel=NULL,Vinv=NULL,mdsmethod="classical",
mdsdim=4, summary.out=FALSE, points.out=FALSE,...)
claraCBI(data,k,usepam=TRUE,diss=inherits(data,"dist"),...)
pamkCBI(data,krange=2:10,k=NULL,criterion="asw", usepam=TRUE,
scaling=TRUE,diss=inherits(data,"dist"),...)
trimkmeansCBI(data,k,scaling=TRUE,trim=0.1,...)
disttrimkmeansCBI(dmatrix,k,scaling=TRUE,trim=0.1,
mdsmethod="classical",
mdsdim=4,...)
dbscanCBI(data,eps,MinPts,diss=inherits(data,"dist"),...)
mahalCBI(data,clustercut=0.5,...)
mergenormCBI(data, G=NULL, k=NULL, emModelNames=NULL, nnk=0,
hcmodel = NULL,
Vinv = NULL, mergemethod="bhat",
cutoff=0.1,...)
speccCBI(data,k,...)
a numeric matrix. The data
matrix - usually a cases*variables-data matrix. claraCBI
,
pamkCBI
and dbscanCBI
work with an
n*n
-dissimilarity matrix as well, see parameter diss
.
a squared numerical dissimilarity matrix or a
dist
-object.
numeric, usually integer. In most cases, this is the number
of clusters for methods where this is fixed. For hclustCBI
and disthclustCBI
see parameter cut
below. Some
methods have a k
parameter on top of a G
or
krange
parameter for compatibility; k
in these cases
does not have to be specified but if it is, it is always a single
number of clusters and overwrites G
and
krange
.
either a logical value or a numeric vector of length
equal to the number of variables. If scaling
is a numeric
vector with length equal to the number of variables, then each
variable is divided by the corresponding value from scaling
.
If scaling
is TRUE
then scaling is done by dividing
the (centered) variables by their root-mean-square, and if
scaling
is FALSE
, no scaling is done before execution.
integer. Number of random initializations from which the k-means algorithm is started.
"ch"
or "asw"
. Decides whether number
of clusters is estimated by the Calinski-Harabasz criterion or by the
average silhouette width.
either "level" or "number". This determines how
cutree
is used to obtain a partition from a hierarchy
tree. cut="level"
means that the tree is cut at a particular
dissimilarity level, cut="number"
means that the tree is cut
in order to obtain a fixed number of clusters. The parameter
k
specifies the number of clusters or the dissimilarity
level, depending on cut
.
method for hierarchical clustering, see the
documentation of hclust
.
numeric. All clusters of size <=noisecut
in the
disthclustCBI
/hclustCBI
-partition are joined and declared as
noise/outliers.
integer. minlevel=1
means that all clusters in
the tree are given out by hclusttreeCBI
or
disthclusttreeCBI
, including one-point
clusters (but excluding the cluster with all
points). minlevel=2
excludes the one-point clusters.
minlevel=3
excludes the two-point cluster which has been
merged first, and increasing the value of minlevel
by 1 in
all further steps means that the remaining earliest formed cluster
is excluded.
vector of integers. Number of clusters or numbers of clusters
used by
mclustBIC
. If
G
has more than one entry, the number of clusters is
estimated by the BIC.
vector of string. Models for covariance matrices,
see documentation of
mclustBIC
.
numeric. See documentation of
mclustBIC
.
logical. If TRUE
, the result of
summary.mclustBIC
is added as component
mclustsummary
to the output of noisemclustCBI
and
distnoisemclustCBI
.
integer. Dimensionality of MDS solution.
logical. If TRUE
, the matrix of MDS points
is added as component
points
to the output of noisemclustCBI
.
logical. If TRUE
, data
will be considered as
a dissimilarity matrix. In claraCBI
, this requires
usepam=TRUE
.
vector of integers. Numbers of clusters to be compared.
numeric between 0 and 1. Proportion of data points
trimmed, i.e., assigned to noise. See tclust
in the tclust package,
trimkmeans
.
numeric. The radius of the neighborhoods to be considered
by dbscan
.
integer. How many points have to be in a neighborhood so
that a point is considered to be a cluster seed? See documentation
of dbscan
.
numeric between 0 and 1. If fixmahal
is used for fuzzy clustering, a crisp partition is generated and
points with cluster membership values above clustercut
are
considered as members of the corresponding cluster.
method for merging Gaussians, passed on as
method
to mergenormals
.
numeric between 0 and 1, tuning constant for
mergenormals
.
further parameters to be transferred to the original clustering functions (not required).
All interface functions return a list with the following components
(there may be some more, see summary.out
and points.out
above):
clustering result, usually a list with the full output of the clustering method (the precise format doesn't matter); whatever you want to use later.
number of clusters. If some points don't belong to any
cluster but are declared as "noise", nc
includes the
noise component, and there should be another component
nccl
, being the number of clusters not including the
noise component.
this is a list consisting of a logical vectors
of length of the number of data points (n
) for each cluster,
indicating whether a point is a member of this cluster
(TRUE
) or not. If a noise component is included, it
should always be the last vector in this list.
an integer vector of length n
,
partitioning the data. If the method produces a partition, it
should be the clustering. This component is only used for plots,
so you could do something like rep(1,n)
for
non-partitioning methods.
a string indicating the clustering method.
see nc
above.
by noisemclustCBI
and distnoisemclustCBI
,
see above.
logical vector, indicating initially estimated noise by
NNclean
, called by noisemclustCBI
and distnoisemclustCBI
.
logical. TRUE
if points were classified as
noise/outliers by disthclustCBI
.
All these functions call clustering methods implemented in R to
cluster data and to provide output in the format required by
clusterboot
. Here is a brief overview. For further
details see the help pages of the involved clustering methods.
an interface to the function
kmeansruns
calling kmeans
for k-means clustering. (kmeansruns
allows the
specification of several random initializations of the
k-means algorithm and estimation of k by the Calinski-Harabasz
index or the average silhouette width.)
an interface to the function
hclust
for agglomerative hierarchical clustering with
noise component (see parameter noisecut
above). This
function produces a partition and assumes a cases*variables
matrix as input.
an interface to the function
hclust
for agglomerative hierarchical clustering. This
function gives out all clusters belonging to the hierarchy
(upward from a certain level, see parameter minlevel
above).
an interface to the function
hclust
for agglomerative hierarchical clustering with
noise component (see parameter noisecut
above). This
function produces a partition and assumes a dissimilarity
matrix as input.
an interface to the function
mclustBIC
, for normal mixture model based
clustering. Warning: mclustBIC
often
has problems with multiple
points. In clusterboot
, it is recommended to use
this together with multipleboot=FALSE
.
an interface to the function
mclustBIC
for normal mixture model based
clustering. This assumes a dissimilarity matrix as input and
generates a data matrix by multidimensional scaling first.
Warning: mclustBIC
often has
problems with multiple
points. In clusterboot
, it is recommended to use
this together with multipleboot=FALSE
.
an interface to the functions
pam
and clara
for partitioning around medoids.
an interface to the function
pamk
calling pam
for
partitioning around medoids. The number
of clusters is estimated by the Calinski-Harabasz index or by the
average silhouette width.
an interface to the function
trimkmeans
for trimmed k-means
clustering. This assumes a cases*variables matrix as input. Note
that for
most applications, tclustCBI
with parameter
restr.fact=1
will do about the same but faster.
an interface to the function
tclust
in the tclust package for trimmed Gaussian
clustering. This assumes a cases*variables matrix as input.
NOTE: This package is currently only available in CRAN as
archived version. Therefore I cannot currently offer the
tclustCBI
-function in fpc
. The code for the
function is below in the Examples-Section, so if you need it,
run that code first.
an interface to the function
trimkmeans
for trimmed k-means
clustering. This assumes a dissimilarity matrix as input and
generates a data matrix by multidimensional scaling first.
an interface to the function
dbscan
for density based
clustering.
an interface to the function
fixmahal
for fixed point
clustering. This assumes a cases*variables matrix as input.
an interface to the function
mergenormals
for clustering by merging Gaussian
mixture components. Unlike mergenormals
, mergenormCBI
includes the computation of the initial Gaussian mixture.
This assumes a cases*variables matrix as input.
an interface to the function
specc
for spectral clustering. See
the specc
help page for additional tuning
parameters. This assumes a cases*variables matrix as input.
clusterboot
, dist
,
kmeans
, kmeansruns
, hclust
,
mclustBIC
,
pam
, pamk
,
clara
,
trimkmeans
, dbscan
,
fixmahal
# NOT RUN {
options(digits=3)
set.seed(20000)
face <- rFace(50,dMoNo=2,dNoEy=0,p=2)
dbs <- dbscanCBI(face,eps=1.5,MinPts=4)
dhc <- disthclustCBI(dist(face),method="average",k=1.5,noisecut=2)
table(dbs$partition,dhc$partition)
dm <- mergenormCBI(face,G=10,emModelNames="EEE",nnk=2)
# Not run:
# Here is the tclustCBI-code:
# tclustCBI <- function(data,k,trim=0.05,...){
# if(require(tclust)){
# data <- as.matrix(data)
# c1 <- tclust(data,k=k,alpha=trim,...)
# sc1c <- c1$cluster
# cl <- list()
# nc <- nccl <- max(sc1c)
# if (sum(sc1c==0)>0){
# nc <- nccl+1
# sc1c[sc1c==0] <- nc
# }
# for (i in 1:nc)
# cl[[i]] <- sc1c == i
# out <- list(result=c1,nc=nc,nccl=nccl,clusterlist=cl,partition=sc1c,
# clustermethod="tclust")
# out
# }
# else
# warning("tclust could not be loaded")
# }
# }
Run the code above in your browser using DataLab