Compute a quick kmeans or hierarchical cluster analysis and displays "cluster characteristics" as plot.
sjc.qclus(data, groupcount = NULL, groups = NULL,
method = c("kmeans", "hclust"), distance = c("euclidean", "maximum",
"manhattan", "canberra", "binary", "minkowski"),
agglomeration = c("ward", "ward.D", "ward.D2", "single", "complete",
"average", "mcquitty", "median", "centroid"), iter.max = 20,
algorithm = c("Hartigan-Wong", "Lloyd", "MacQueen"),
show.accuracy = FALSE, title = NULL, axis.labels = NULL,
wrap.title = 40, wrap.labels = 20, wrap.legend.title = 20,
wrap.legend.labels = 20, facet.grid = FALSE,
geom.colors = "Paired", geom.size = 0.5, geom.spacing = 0.1,
show.legend = TRUE, show.grpcnt = TRUE, legend.title = NULL,
legend.labels = NULL, coord.flip = FALSE, reverse.axis = FALSE)
A data frame with variables that should be used for the cluster analysis.
Amount of groups (clusters) used for the cluster solution. May also be
a set of initial (distinct) cluster centres, in case method = "kmeans"
(see kmeans
for details on centers
argument).
If groupcount = NULL
and method = "kmeans"
, the optimal
amount of clusters is calculated using the gap statistics (see
sjc.kgap
). For method = "hclust"
, groupcount
needs to be specified. Following functions may be helpful for estimating
the amount of clusters:
Use sjc.elbow
to determine the group-count depending on the elbow-criterion.
If method = "kmeans"
, use sjc.kgap
to determine the group-count according to the gap-statistic.
If method = "hclust"
(hierarchical clustering, default), use sjc.dend
to inspect different cluster group solutions.
Use sjc.grpdisc
to inspect the goodness of grouping (accuracy of classification).
Optional, by default, this argument is NULL
and will be
ignored. However, to plot existing cluster groups, specify groupcount
and groups
. groups
is a vector of same length as
nrow(data)
and indicates the group classification of the cluster
analysis. The group classification can be computed with the
sjc.cluster
function. See 'Examples'.
Method for computing the cluster analysis. By default ("kmeans"
), a
kmeans cluster analysis will be computed. Use "hclust"
to
compute a hierarchical cluster analysis. You can specify the
initial letters only.
Distance measure to be used when method = "hclust"
(for hierarchical
clustering). Must be one of "euclidean"
, "maximum"
, "manhattan"
,
"canberra"
, "binary"
or "minkowski"
. See dist
.
If is method = "kmeans"
this argument will be ignored.
Agglomeration method to be used when method = "hclust"
(for hierarchical
clustering). This should be one of "ward"
, "single"
, "complete"
, "average"
,
"mcquitty"
, "median"
or "centroid"
. Default is "ward"
(see hclust
).
If method = "kmeans"
this argument will be ignored. See 'Note'.
Maximum number of iterations allowed. Only applies, if
method = "kmeans"
. See kmeans
for details on this argument.
Algorithm used for calculating kmeans cluster. Only applies, if
method = "kmeans"
. May be one of "Hartigan-Wong"
(default),
"Lloyd"
(used by SPSS), or "MacQueen"
. See kmeans
for details on this argument.
Logical, if TRUE
, the sjc.grpdisc
function will be called,
which computes a linear discriminant analysis on the classified cluster groups and plots a
bar graph indicating the goodness of classification for each group.
character vector, used as plot title. Depending on plot type and function,
will be set automatically. If title = ""
, no title is printed.
For effect-plots, may also be a character vector of length > 1,
to define titles for each sub-plot or facet.
character vector with labels used as axis labels. Optional argument, since in most cases, axis labels are set automatically.
numeric, determines how many chars of the plot title are displayed in one line and when a line break is inserted.
numeric, determines how many chars of the value, variable or axis labels are displayed in one line and when a line break is inserted.
numeric, determines how many chars of the legend's title are displayed in one line and when a line break is inserted.
numeric, determines how many chars of the legend labels are displayed in one line and when a line break is inserted.
TRUE
to arrange the lay out of of multiple plots
in a grid of an integrated single plot. This argument calls
facet_wrap
or facet_grid
to arrange plots. Use plot_grid
to plot multiple plot-objects
as an arranged grid with grid.arrange
.
user defined color for geoms. See 'Details' in sjp.grpfrq
.
size resp. width of the geoms (bar width, line thickness or point size, depending on plot type and function). Note that bar and bin widths mostly need smaller values than dot sizes.
the spacing between geoms (i.e. bar spacing)
logical, if TRUE
, and depending on plot type and
function, a legend is added to the plot.
Logical, if TRUE
(default), the count within each cluster group is added to the
legend labels (e.g. "Group 1 (n=87)"
).
character vector, used as title for the plot legend.
character vector with labels for the guide/legend.
logical, if TRUE
, the x and y axis are swapped.
Logical, if TRUE
, the values on the x-axis are reversed.
(Invisibly) returns an object with
data
: the used data frame for plotting,
plot
: the ggplot object,
groupcount
: the number of found cluster (as calculated by sjc.kgap
)
classification
: the group classification (as calculated by sjc.cluster
), including missing values, so this vector can be appended to the original data frame.
accuracy
: the accuracy of group classification (as calculated by sjc.grpdisc
).
Following steps are computed in this function:
If method = "kmeans"
, this function first determines the optimal group count via gap statistics (unless argument groupcount
is specified), using the sjc.kgap
function.
A cluster analysis is performed by running the sjc.cluster
function to determine the cluster groups.
Then, all variables in data
are scaled and centered. The mean value of these z-scores within each cluster group is calculated to see how certain characteristics (variables) in a cluster group differ in relation to other cluster groups.
These results are plotted as graph.
This method can also be used to plot existing cluster solution as graph witouth computing
a new cluster analysis. See argument groups
for more details.
Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2014) cluster: Cluster Analysis Basics and Extensions. R package.
# NOT RUN {
# k-means clustering of mtcars-dataset
sjc.qclus(mtcars)
# k-means clustering of mtcars-dataset with 4 pre-defined
# groups in a faceted panel
sjc.qclus(airquality, groupcount = 4, facet.grid = TRUE)
# }
# NOT RUN {
# k-means clustering of airquality data
# and saving the results. most likely, 3 cluster
# groups have been found (see below).
airgrp <- sjc.qclus(airquality)
# "re-plot" cluster groups, without computing
# new k-means cluster analysis.
sjc.qclus(airquality, groupcount = 3, groups = airgrp$classification)
# }
Run the code above in your browser using DataLab