Learn R Programming

sjPlot (version 2.1.0)

sjc.qclus: Compute quick cluster analysis

Description

Compute a quick kmeans or hierarchical cluster analysis and displays "cluster characteristics" as plot.

Usage

sjc.qclus(data, groupcount = NULL, groups = NULL, method = c("kmeans", "hclust"), distance = c("euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski"), agglomeration = c("ward", "ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median", "centroid"), iter.max = 20, algorithm = c("Hartigan-Wong", "Lloyd", "MacQueen"), show.accuracy = FALSE, title = NULL, axis.labels = NULL, wrap.title = 40, wrap.labels = 20, wrap.legend.title = 20, wrap.legend.labels = 20, facet.grid = FALSE, geom.colors = "Paired", geom.size = 0.5, geom.spacing = 0.1, show.legend = TRUE, show.grpcnt = TRUE, legend.title = NULL, legend.labels = NULL, coord.flip = FALSE, reverse.axis = FALSE, prnt.plot = TRUE)

Arguments

data
data.frame with variables that should be used for the cluster analysis.
groupcount
amount of groups (clusters) used for the cluster solution. May also be a set of initial (distinct) cluster centres, in case method = "kmeans" (see kmeans for details on centers argument). If groupcount = NULL and method = "kmeans", the optimal amount of clusters is calculated using the gap statistics (see sjc.kgap). For method = "hclust", groupcount needs to be specified. Following functions may be helpful for estimating the amount of clusters:
  • Use sjc.elbow to determine the group-count depending on the elbow-criterion.
  • If method = "kmeans", use sjc.kgap to determine the group-count according to the gap-statistic.
  • If method = "hclust" (hierarchical clustering, default), use sjc.dend to inspect different cluster group solutions.
  • Use sjc.grpdisc to inspect the goodness of grouping (accuracy of classification).
groups
optional, by default, this argument is NULL and will be ignored. However, to plot existing cluster groups, specify groupcount and groups. groups is a vector of same length as nrow(data) and indicates the group classification of the cluster analysis. The group classification can be computed with the sjc.cluster function. See 'Examples'.
method
method for computing the cluster analysis. By default ("kmeans"), a kmeans cluster analysis will be computed. Use "hclust" to compute a hierarchical cluster analysis. You can specify the initial letters only.
distance
distance measure to be used when method = "hclust" (for hierarchical clustering). Must be one of "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski". See dist. If is method = "kmeans" this argument will be ignored.
agglomeration
agglomeration method to be used when method = "hclust" (for hierarchical clustering). This should be one of "ward", "single", "complete", "average", "mcquitty", "median" or "centroid". Default is "ward" (see hclust). If method = "kmeans" this argument will be ignored. See 'Note'.
iter.max
maximum number of iterations allowed. Only applies, if method = "kmeans". See kmeans for details on this argument.
algorithm
algorithm used for calculating kmeans cluster. Only applies, if method = "kmeans". May be one of "Hartigan-Wong" (default), "Lloyd" (used by SPSS), or "MacQueen". See kmeans for details on this argument.
show.accuracy
logical, if TRUE, the sjc.grpdisc function will be called, which computes a linear discriminant analysis on the classified cluster groups and plots a bar graph indicating the goodness of classification for each group.
title
character vector, used as plot title. Depending on plot type and function, will be set automatically. If title = "", no title is printed.
axis.labels
character vector with labels used as axis labels. Optional argument, since in most cases, axis labels are set automatically.
wrap.title
numeric, determines how many chars of the plot title are displayed in one line and when a line break is inserted.
wrap.labels
numeric, determines how many chars of the value, variable or axis labels are displayed in one line and when a line break is inserted.
wrap.legend.title
numeric, determines how many chars of the legend's title are displayed in one line and when a line break is inserted.
wrap.legend.labels
numeric, determines how many chars of the legend labels are displayed in one line and when a line break is inserted.
facet.grid
TRUE to arrange the lay out of of multiple plots in a grid of an integrated single plot. This argument calls facet_wrap or facet_grid to arrange plots. Use plot_grid to plot multiple plot-objects as an arranged grid with grid.arrange.
geom.colors
user defined color for geoms. See 'Details' in sjp.grpfrq.
geom.size
size resp. width of the geoms (bar width, line thickness or point size, depending on plot type and function). Note that bar and bin widths mostly need smaller values than dot sizes.
geom.spacing
the spacing between geoms (i.e. bar spacing)
show.legend
logical, if TRUE, and depending on plot type and function, a legend is added to the plot.
show.grpcnt
if TRUE (default), the count within each cluster group is added to the legend labels (e.g. "Group 1 (n=87)").
legend.title
character vector, used as title for the plot legend.
legend.labels
character vector with labels for the guide/legend.
coord.flip
logical, if TRUE, the x and y axis are swapped.
reverse.axis
if TRUE, the values on the x-axis are reversed.
prnt.plot
logical, if TRUE (default), plots the results as graph. Use FALSE if you don't want to plot any graphs. In either case, the ggplot-object will be returned as value.

Value

(Invisibly) returns an object with
  • data: the used data frame for plotting,
  • plot: the ggplot object,
  • groupcount: the number of found cluster (as calculated by sjc.kgap)
  • classification: the group classification (as calculated by sjc.cluster), including missing values, so this vector can be appended to the original data frame.
  • accuracy: the accuracy of group classification (as calculated by sjc.grpdisc).

Details

Following steps are computed in this function:
  1. If method = "kmeans", this function first determines the optimal group count via gap statistics (unless argument groupcount is specified), using the sjc.kgap function.
  2. A cluster analysis is performed by running the sjc.cluster function to determine the cluster groups.
  3. Then, all variables in data are scaled and centered. The mean value of these z-scores within each cluster group is calculated to see how certain characteristics (variables) in a cluster group differ in relation to other cluster groups.
  4. These results are plotted as graph.

This method can also be used to plot existing cluster solution as graph witouth computing a new cluster analysis. See argument groups for more details.

References

Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2014) cluster: Cluster Analysis Basics and Extensions. R package.

Examples

Run this code
## Not run: 
# # k-means clustering of mtcars-dataset
# sjc.qclus(mtcars)
# 
# # k-means clustering of mtcars-dataset with 4 pre-defined
# # groups in a faceted panel
# sjc.qclus(airquality, groupcount = 4, facet.grid = TRUE)## End(Not run)
          
# k-means clustering of airquality data
# and saving the results. most likely, 3 cluster
# groups have been found (see below).
airgrp <- sjc.qclus(airquality)

# "re-plot" cluster groups, without computing
# new k-means cluster analysis.
sjc.qclus(airquality, groupcount = 3, groups = airgrp$classification)

Run the code above in your browser using DataLab