sjc.qclus: Compute quick cluster analysis

Description

Compute a quick kmeans or hierarchical cluster analysis and displays "cluster characteristics" as plot.

Usage

sjc.qclus(data, groupcount = NULL, groups = NULL, method = c("kmeans",
  "hclust"), distance = c("euclidean", "maximum", "manhattan", "canberra",
  "binary", "minkowski"), agglomeration = c("ward", "ward.D", "ward.D2",
  "single", "complete", "average", "mcquitty", "median", "centroid"),
  iter.max = 20, algorithm = c("Hartigan-Wong", "Lloyd", "MacQueen"),
  show.accuracy = FALSE, title = NULL, axis.labels = NULL,
  wrap.title = 40, wrap.labels = 20, wrap.legend.title = 20,
  wrap.legend.labels = 20, facet.grid = FALSE, geom.colors = "Paired",
  geom.size = 0.5, geom.spacing = 0.1, show.legend = TRUE,
  show.grpcnt = TRUE, legend.title = NULL, legend.labels = NULL,
  coord.flip = FALSE, reverse.axis = FALSE, prnt.plot = TRUE)

Arguments

data

A data frame with variables that should be used for the cluster analysis.

groupcount

Amount of groups (clusters) used for the cluster solution. May also be a set of initial (distinct) cluster centres, in case method = "kmeans" (see kmeans for details on centers argument). If groupcount = NULL and method = "kmeans", the optimal amount of clusters is calculated using the gap statistics (see sjc.kgap). For method = "hclust", groupcount needs to be specified. Following functions may be helpful for estimating the amount of clusters:

Use sjc.elbow to determine the group-count depending on the elbow-criterion.
If method = "kmeans", use sjc.kgap to determine the group-count according to the gap-statistic.
If method = "hclust" (hierarchical clustering, default), use sjc.dend to inspect different cluster group solutions.
Use sjc.grpdisc to inspect the goodness of grouping (accuracy of classification).

groups

Optional, by default, this argument is NULL and will be ignored. However, to plot existing cluster groups, specify groupcount and groups. groups is a vector of same length as nrow(data) and indicates the group classification of the cluster analysis. The group classification can be computed with the sjc.cluster function. See 'Examples'.

method

Method for computing the cluster analysis. By default ("kmeans"), a kmeans cluster analysis will be computed. Use "hclust" to compute a hierarchical cluster analysis. You can specify the initial letters only.

distance

Distance measure to be used when method = "hclust" (for hierarchical clustering). Must be one of "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski". See dist. If is method = "kmeans" this argument will be ignored.

agglomeration

Agglomeration method to be used when method = "hclust" (for hierarchical clustering). This should be one of "ward", "single", "complete", "average", "mcquitty", "median" or "centroid". Default is "ward" (see hclust). If method = "kmeans" this argument will be ignored. See 'Note'.

iter.max

Maximum number of iterations allowed. Only applies, if method = "kmeans". See kmeans for details on this argument.

algorithm

Algorithm used for calculating kmeans cluster. Only applies, if method = "kmeans". May be one of "Hartigan-Wong" (default), "Lloyd" (used by SPSS), or "MacQueen". See kmeans for details on this argument.

show.accuracy

Logical, if TRUE, the sjc.grpdisc function will be called, which computes a linear discriminant analysis on the classified cluster groups and plots a bar graph indicating the goodness of classification for each group.

title

character vector, used as plot title. Depending on plot type and function, will be set automatically. If title = "", no title is printed. For effect-plots, may also be a character vector of length > 1, to define titles for each sub-plot or facet.

axis.labels

character vector with labels used as axis labels. Optional argument, since in most cases, axis labels are set automatically.

wrap.title

numeric, determines how many chars of the plot title are displayed in one line and when a line break is inserted.

wrap.labels

numeric, determines how many chars of the value, variable or axis labels are displayed in one line and when a line break is inserted.

wrap.legend.title

numeric, determines how many chars of the legend's title are displayed in one line and when a line break is inserted.

wrap.legend.labels

numeric, determines how many chars of the legend labels are displayed in one line and when a line break is inserted.

facet.grid

TRUE to arrange the lay out of of multiple plots in a grid of an integrated single plot. This argument calls facet_wrap or facet_grid to arrange plots. Use plot_grid to plot multiple plot-objects as an arranged grid with grid.arrange.

geom.colors

user defined color for geoms. See 'Details' in sjp.grpfrq.

geom.size

size resp. width of the geoms (bar width, line thickness or point size, depending on plot type and function). Note that bar and bin widths mostly need smaller values than dot sizes.

geom.spacing

the spacing between geoms (i.e. bar spacing)

show.legend

logical, if TRUE, and depending on plot type and function, a legend is added to the plot.

show.grpcnt

Logical, if TRUE (default), the count within each cluster group is added to the legend labels (e.g. "Group 1 (n=87)").

legend.title

character vector, used as title for the plot legend.

legend.labels

character vector with labels for the guide/legend.

coord.flip

logical, if TRUE, the x and y axis are swapped.

reverse.axis

Logical, if TRUE, the values on the x-axis are reversed.

prnt.plot

logical, if TRUE (default), plots the results as graph. Use FALSE if you don't want to plot any graphs. In either case, the ggplot-object will be returned as value.

Value

(Invisibly) returns an object with

data: the used data frame for plotting,
plot: the ggplot object,
groupcount: the number of found cluster (as calculated by sjc.kgap)
classification: the group classification (as calculated by sjc.cluster), including missing values, so this vector can be appended to the original data frame.
accuracy: the accuracy of group classification (as calculated by sjc.grpdisc).

Details

Following steps are computed in this function:

If method = "kmeans", this function first determines the optimal group count via gap statistics (unless argument groupcount is specified), using the sjc.kgap function.
A cluster analysis is performed by running the sjc.cluster function to determine the cluster groups.
Then, all variables in data are scaled and centered. The mean value of these z-scores within each cluster group is calculated to see how certain characteristics (variables) in a cluster group differ in relation to other cluster groups.
These results are plotted as graph.

This method can also be used to plot existing cluster solution as graph witouth computing a new cluster analysis. See argument groups for more details.

References

Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2014) cluster: Cluster Analysis Basics and Extensions. R package.

Examples

Run this code

# NOT RUN {
# k-means clustering of mtcars-dataset
sjc.qclus(mtcars)

# k-means clustering of mtcars-dataset with 4 pre-defined
# groups in a faceted panel
sjc.qclus(airquality, groupcount = 4, facet.grid = TRUE)
# }
# NOT RUN {
# k-means clustering of airquality data
# and saving the results. most likely, 3 cluster
# groups have been found (see below).
airgrp <- sjc.qclus(airquality)

# "re-plot" cluster groups, without computing
# new k-means cluster analysis.
sjc.qclus(airquality, groupcount = 3, groups = airgrp$classification)

# }

Run the code above in your browser using DataLab