dtwclust: Time series clustering

Description

This is the main function to perform time series clustering. It supports partitional, hierarchical, fuzzy, k-Shape and TADPole clustering. See the details and the examples for more information.

Usage

dtwclust(data = NULL, type = "partitional", k = 2L, method = "average", distance = "dtw", centroid = "pam", preproc = NULL, dc = NULL, control = NULL, seed = NULL, distmat = NULL, ...)

Arguments

data

A list of series, a numeric matrix or a data frame. Matrices are coerced row-wise and data frames column-wise.

type

What type of clustering method to use: "partitional", "hierarchical", "tadpole" or "fuzzy".

Number of desired clusters. It may be a numeric vector with different values.

method

One or more linkage methods to use in hierarchical procedures. See hclust. You can provide a character vector to compute different hierarchical cluster structures in one go, or specify method = "all" to use all the available ones.

distance

A supported distance from proxy's dist (see Distance section). Ignored for type = "tadpole".

centroid

Either a supported string or an appropriate function to calculate centroids when using partitional or prototypes for hierarchical/tadpole methods. See Centroid section.

preproc

Function to preprocess data. Defaults to zscore only if centroid = "shape", but will be replaced by a custom function if provided. See Preprocessing section.

Cutoff distance for the TADPole algorithm.

control

Named list of parameters or dtwclustControl object for clustering algorithms. See dtwclustControl. NULL means defaults.

seed

Random seed for reproducibility of partitional and fuzzy algorithms.

distmat

If a cross-distance matrix is already available, it can be provided here so it's re-used. Only relevant if centroid = "pam" or type = "hierarchical". See examples.

...

Additional arguments to pass to dist or a custom function.

Value

An object with formal class dtwclust-class.If control@nrep > 1 and a partitional procedure is used, length(method) > 1 and hierarchical procedures are used, or length(k) > 1, a list of objects is returned.

Partitional Clustering

Stochastic algorithm that creates a hard partition of the data into k clusters, where each cluster has a center/centroid. In case of time series clustering, the centroids are also time series. The cluster centroids are first randomly initialized by selecting some of the series in the data. The distance between each series and each centroid is calculated, and the series are assigned to the cluster whose centroid is closest. The centroids are then updated by using a given rule, and the procedure is repeated until no series changes from one cluster to another, or until the maximum number of iterations* has been reached. The distance and centroid definitions can be specified through the corresponding parameters of this function. See their respective sections below. Note that it is possible for a cluster to become empty, in which case a new cluster is reinitialized randomly. However, if you see that the algorithm doesn't converge or the overall distance sum increases, it could mean that the chosen value of k is too large, or the chosen distance measure is not able to assess similarity effectively. The random reinitialization attempts to enforce a certain number of clusters, but can result in instability in the aforementioned cases.

Fuzzy Clustering

This procedure is very similar to partitional clustering, except that each series no longer belongs exclusively to one cluster, but belongs to each cluster to a certain degree. For each series, the total degree of membership across clusters must sum to 1. The default implementation uses the fuzzy c-means algorithm. In its definition, an objective function is to be minimized. The objective is defined in terms of a squared distance, which is usually the Euclidean (L2) distance, although the definition could be modified. The distance parameter of this function controls the one that is utilized. The fuzziness of the clustering can be controlled by means of the fuzziness exponent*. Bear in mind that the centroid definition of fuzzy c-means requires equal dimensions, which means that all series must have the same length. This problem can be circumvented by applying transformations to the series (see for example D'Urso and Maharaj, 2009). Note that the fuzzy clustering could be transformed to a crisp one by finding the highest membership coefficient. Some of the slots of the object returned by this function assume this, so be careful with interpretation (see dtwclust-class).

Hierarchical Clustering

This is a deterministic algorithm that creates a hierarchy of groups by using different linkage methods (see hclust). The linkage method is controlled through the method parameter of this function, with the additional option "all" that uses all of the available methods. The distance to be used can be controlled with the distance parameter. The hierarchy does not imply a specific number of clusters, but one can be induced by cutting the resulting dendrogram (see cutree). As with fuzzy clustering, this results in a crisp partition, and some of the slots of the returned object are calculated by cutting the dendrogram so that k clusters are created.

TADPole Clustering

TADPole clustering adopts a relatively new clustering framework and adapts it to time series clustering with DTW. Because of the way it works, it can be considered a kind of Partitioning Around Medoids (PAM). This means that the cluster centers are always elements of the data. However, this algorithm is deterministic, depending on the value of the cutoff distance dc, which can be controlled with the corresponding parameter of this function. The algorithm relies on the DTW lower bounds, which are only defined for time series of equal length. Additionally, it requires a window constraint* for DTW. See the Sakoe-Chiba constraint section below. Unlike the other algorithms, TADPole always uses DTW2 as distance (with a symmetric1 step pattern).

Centroid Calculation

In the case of partitional/fuzzy algorithms, a suitable function should calculate the cluster centers at every iteration. In this case, the centers are themselves time series. Fuzzy clustering uses the standard fuzzy c-means centroid by default. In either case, a custom function can be provided. If one is provided, it will receive the following inputs in the shown order (examples for partitional clustering are shown in parenthesis):

The whole data list (list(ts1, ts2, ts3))
A numeric vector with length equal to the number of series in data, indicating which cluster a series belongs to (c(1L, 2L, 2L))
The desired number of total clusters (2L)
The current centers in order, in a list (list(center1, center2))
The membership vector of the previous iteration (c(1L, 1L, 2L))
The elements of ...

Therefore, the function should always include the ellipsis ... in its definition. In case of fuzzy clustering, the membership vectors (2nd and 5th elements above) are matrices with number of rows equal to amount of elements in the data, and number of columns equal to the number of desired clusters. Each row must sum to 1. The other option is to provide a character string for the custom implementations. The following options are available:

"mean": The average along each dimension. In other words, the average of all $x^j_i$ among the $j$ series that belong to the same cluster for all time points $t_i$.
"median": The median along each dimension. Similar to mean.
"shape": Shape averaging. By default, all series are z-normalized in this case, since the resulting centroids will also have this normalization. See shape_extraction for more details.
"dba": DTW Barycenter Averaging. See DBA for more details.
"pam": Partition around medoids (PAM). This basically means that the cluster centers are always one of the time series in the data. In this case, the distance matrix can be pre-computed once using all time series in the data and then re-used at each iteration. It usually saves overhead overall.
"fcm": Fuzzy c-means. Only supported for fuzzy clustering and always used for that type of clustering if a string is provided in centroid.

These check for the special cases where parallelization might be desired. Note that only shape, dba and pam support series of different length. Also note that, for shape and dba, this support has a caveat: the final centroids' length will depend on the length of those series that were randomly chosen at the beginning of the clustering algorithm. For example, if the series in the dataset have a length of either 10 or 15, 2 clusters are desired, and the initial choice selects two series with length of 10, the final centroids will have this same length. As special cases, if hierarchical or tadpole clustering is used, you can provide a centroid function that takes a list of series as only input and returns a single centroid series. These centroids are returned in the centers slot. By default, a type of PAM centroid function is used.

Distance Measures

The distance measure to be used with partitional, hierarchical and fuzzy clustering can be modified with the distance parameter. The supported option is to provide a string, which must represent a compatible distance registered with proxy's dist. Registration is done via pr_DB, and extra parameters can be provided in .... Note that you are free to create your own distance functions and register them. Optionally, you can use one of the following custom implementations (all registered with proxy):

"dtw": DTW, optionally with a Sakoe-Chiba/Slanted-band constraint*.
"dtw2": DTW with L2 norm and optionally a Sakoe-Chiba/Slanted-band constraint*. Read details below.
"dtw_lb": DTW with L1 or L2 norm* and optionally a Sakoe-Chiba constraint*. Some computations are avoided by first estimating the distance matrix with Lemire's lower bound and then iteratively refining with DTW. See dtw_lb. Not suitable for pam.precompute* = TRUE.
"lbk": Keogh's lower bound with either L1 or L2 norm* for the Sakoe-Chiba constraint*.
"lbi": Lemire's lower bound with either L1 or L2 norm* for the Sakoe-Chiba constraint*.
"sbd": Shape-based distance. See SBD for more details.

DTW2 is done with dtw, but it differs from the result you would obtain if you specify L2 as dist.method: with DTW2, pointwise distances (the local cost matrix) are calculated with L1 norm, each element of the matrix is squared and the result is fed into dtw, which finds the optimum warping path. The square root of the resulting distance is then computed. Only dtw, dtw2 and sbd support series of different length. The lower bounds are probably unsuitable for direct clustering unless series are very easily distinguishable. If you create your own distance, register it with proxy, and it includes the ellipsis (...) in its definition, it will receive the following parameters*:

window.type: Either "none" for a NULL window.size, or "slantedband" otherwise
window.size: The provided window size
norm: The provided desired norm
...: Any additional parameters provided in the original call's ellipsis

Whether your function makes use of them or not, is up to you. If you know that the distance function is symmetric, and you use a hierarchical algorithm, or a partitional algorithm with PAM centroids and pam.precompute* = TRUE, some time can be saved by calculating only half the distance matrix. Therefore, consider setting the symmetric* control parameter to TRUE if this is the case.

Sakoe-Chiba Constraint

A global constraint to speed up the DTW calculations is the Sakoe-Chiba band (Sakoe and Chiba, 1978). To use it, a window width* must be defined. The windowing constraint uses a centered window. The function expects a value in window.size that represents the distance between the point considered and one of the edges of the window. Therefore, if, for example, window.size = 10, the warping for an observation $x_i$ considers the points between $x_{i-10}$ and $x_{i+10}$, resulting in 10(2) + 1 = 21 observations falling within the window. The computations actually use a slantedband window, which is equivalent to the Sakoe-Chiba one if series have equal length, and stays along the diagonal of the local cost matrix if series have different length.

Preprocessing

It is strongly advised to use z-normalization in case of centroid = "shape", because the resulting series have this normalization (see shape_extraction). Therefore, zscore is the default in this case. The user can, however, specify a custom function that performs any transformation on the data, but the user must make sure that the format stays consistent, i.e. a list of time series. Setting to NULL means no preprocessing (except for centroid = "shape"). A provided function will receive the data as first and only argument. It is convenient to provide this function if you're planning on using the predict generic.

Repetitions

Due to their stochastic nature, partitional clustering is usually repeated* several times with different random seeds to allow for different starting points. This function uses RNGseq to obtain different seed streams for each repetition, utilizing the seed parameter (if provided) to initialize it. If more than one repetition is made, the streams are returned in an attribute called rng. Technically, you can also perform random repetitions for fuzzy clustering, although it might be difficult to evaluate the results, since they are usually evaluated relative to each other and not in an absolute way. Ideally, the groups wouldn't change too much once the algorithm converges. Multiple values of k can also be provided to get different partitions using any type of clustering. Repetitions are greatly optimized when PAM centroids are used and the whole distance matrix is precomputed*, since said matrix is reused for every repetition, and can be comptued in parallel (see Parallel section).

Parallel Computing

Please note that running tasks in parallel does not guarantee faster computations. The overhead introduced is sometimes too large, and it's better to run tasks sequentially. The user can register a parallel backend, for eample with the doParallel package, in order to do the repetitions in parallel, as well as distance and some centroid calculations. Unless each repetition requires a few seconds, parallel computing probably isn't worth it. As such, I would only use this feature with shape and DBA centroids, or an expensive distance function like DTW. If you register a parallel backend, the function will also try to do the calculation of the distance matrices in parallel. This should work with any function registered with dist via pr_DB whose loop flag is set to TRUE. If the function requires special packages to be loaded, provide their names in the packages* slot of control. Note that "dtwclust" is always loaded in each parallel worker, so that doesn't need to be included. Alternatively, you may want to pre-load dtwclust in each worker with clusterEvalQ. In case of multiple repetitions, each worker gets a repetition task. Otherwise, the tasks (which can be a distance matrix or a centroid calculation) are usually divided into chunks according to the number of workers available.

Notes

The lower bounds are defined only for time series of equal length. DTW and DTW2 don't require this, but they are much slower to compute. The lower bounds are not symmetric, and DTW is only symmetric if series have equal lengths.

Details

Partitional and fuzzy clustering procedures use a custom implementation. Hierarchical clustering is done with hclust. TADPole clustering uses the TADPole function. Specifying type = "partitional", distance = "sbd" and centroid = "shape" is equivalent to the k-Shape algorithm (Paparrizos and Gravano, 2015).

The data may be a matrix, a data frame or a list. Matrices and data frames are coerced to a list, the former row-wise and the latter column-wise. Only lists can have series with different lengths. Most of the optimizations require series to have the same length, so consider reinterpolating them to save some time (see Ratanamahatana and Keogh, 2004; reinterpolate). No missing values are allowed.

Several parameters can be adjusted with the control argument. See dtwclustControl. In the following sections, elements marked with an asterisk (*) are those that can be adjutsed with this argument.

References

Sakoe H and Chiba S (1978). ``Dynamic programming algorithm optimization for spoken word recognition.'' Acoustics, Speech and Signal Processing, IEEE Transactions on, 26(1), pp. 43-49. ISSN 0096-3518, http://doi.org/10.1109/TASSP.1978.1163055.

Ratanamahatana A and Keogh E (2004). ``Everything you know about dynamic time warping is wrong.'' In 3rd Workshop on Mining Temporal and Sequential Data, in conjunction with 10th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD-2004), Seattle, WA.

Bedzek, J.C. (1981). Pattern recognition with fuzzy objective function algorithms.

D'Urso, P., & Maharaj, E. A. (2009). Autocorrelation-based fuzzy clustering of time series. Fuzzy Sets and Systems, 160(24), 3565-3589.

Paparrizos J and Gravano L (2015). ``k-Shape: Efficient and Accurate Clustering of Time Series.'' In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, series SIGMOD '15, pp. 1855-1870. ISBN 978-1-4503-2758-9, http://doi.org/10.1145/2723372.2737793.

Examples

Run this code

## NOTE: All examples should work, some are simply set to not run automatically

# Load data
data(uciCT)

# Controls
ctrl <- new("dtwclustControl", trace = TRUE)

# ====================================================================================
# Simple partitional clustering with Euclidean distance and PAM centroids
# ====================================================================================

# Reinterpolate to same length and coerce as matrix
data <- t(sapply(CharTraj, reinterpolate, newLength = 180))

# Subset for speed
data <- data[1:20, ]
labels <- CharTrajLabels[1:20]

# Testing several values of k
kc.l2 <- dtwclust(data, k = 2:4,
                  distance = "L2", centroid = "pam",
                  seed = 3247, control = ctrl)

cat("Rand indices for L2+PAM:\n")
sapply(kc.l2, randIndex, y = labels)

plot(kc.l2[[3L]])

# ====================================================================================
# Hierarchical clustering with Euclidean distance
# ====================================================================================

# Re-use the distance matrix from the previous example
hc.l2 <- update(kc.l2[[3L]], k = 4L,
                type = "hierarchical", method = "all",
                distmat = kc.l2[[3L]]@distmat)

cat("Rand indices for L2+HC:\n")
sapply(hc.l2, randIndex, y = labels)

# ====================================================================================
# Registering a custom distance with the 'proxy' package and using it
# ====================================================================================

# Normalized DTW distance
ndtw <- function(x, y, ...) {
     dtw::dtw(x, y, step.pattern = symmetric2,
              distance.only = TRUE, ...)$normalizedDistance
}

# Registering the function with 'proxy'
if (!pr_DB$entry_exists("nDTW"))
     proxy::pr_DB$set_entry(FUN = ndtw, names=c("nDTW"),
                            loop = TRUE, type = "metric", distance = TRUE,
                            description = "Normalized DTW")

# Subset of (original) data for speed
kc.ndtw <- dtwclust(CharTraj[31:39], k = 2L,
                    distance = "nDTW",
                    control = ctrl, seed = 8319)

# Where would series 40 go?
predict(kc.ndtw, newdata = CharTraj[40])

# ====================================================================================
# Hierarchical clustering based on shabe-based distance
# ====================================================================================

# Original data
hc.sbd <- dtwclust(CharTraj, type = "hierarchical",
                   k = 20, method = "all",
                   distance = "sbd", control = ctrl)

# Plot dendrogram
plot(hc.sbd[[ which.max(sapply(hc.sbd, randIndex, y = CharTrajLabels)) ]])

# Plot subset of crisp partition with k = 20
plot(hc.sbd[[ which.max(sapply(hc.sbd, randIndex, y = CharTrajLabels)) ]],
     clus = 1, type = "sc")

# ====================================================================================
# Autocorrelation-based fuzzy clustering (see D'Urso and Maharaj 2009)
# ====================================================================================

# Calculate autocorrelation up to 50th lag, considering a list of time series as input
acf_fun <- function(dat) {
     lapply(dat, function(x) as.numeric(acf(x, lag.max = 50, plot = FALSE)$acf))
}

# Fuzzy c-means
fc <- dtwclust(CharTraj[1:25], type = "fuzzy", k = 5,
               preproc = acf_fun, distance = "L2",
               seed = 123, control = ctrl)

# Fuzzy memberships
print(fc@fcluster)

# Plot crisp partition in the original space
plot(fc, data = CharTraj[1:25], type = "series")

# Membership for new series
# (It's important that 'preproc' was specified in the original call)
predict(fc, newdata = CharTraj[26:28])

## Not run: 
# # ====================================================================================
# # TADPole clustering
# # ====================================================================================
# 
# ctrl@window.size <- 20L
# 
# kc.tadp <- dtwclust(data, type = "tadpole", k = 4L,
#                     dc = 1.5, control = ctrl)
# 
# cat("Rand index for TADPole:", randIndex(kc.tadp, labels), "\n\n")
# 
# # Modify plot
# plot(kc.tadp, clus = 3:4,
#      labs.arg = list(title = "TADPole, clusters 3 and 4",
#                      x = "time", y = "series"))
# 
# # ====================================================================================
# # Saving and modifying the ggplot object with custom time labels
# # ====================================================================================
# 
# t <- seq(Sys.Date(), len = 180, by = "day")
# gkc <- plot(kc.tadp, time = t, plot = FALSE)
# 
# require(scales)
# gkc + scale_x_date(labels = date_format("%b-%Y"),
#                    breaks = date_breaks("2 months"))
# 
# # ====================================================================================
# # Specifying a centroid function for prototype extraction in hierarchical clustering
# # ====================================================================================
# 
# # Seed is due to possible randomness in shape_extraction when selecting a basis series
# hc <- dtwclust(CharTraj, type = "hierarchical",
#                k = 20, distance = "sbd",
#                preproc = zscore, centroid = shape_extraction,
#                seed = 320)
# 
# plot(hc, type = "sc")
# 
# # ====================================================================================
# # Using parallel computation to optimize several random repetitions
# # and distance matrix calculation
# # ====================================================================================
# require(doParallel)
# 
# # Create parallel workers
# cl <- makeCluster(detectCores())
# invisible(clusterEvalQ(cl, library(dtwclust)))
# registerDoParallel(cl)
# 
# ## Use constrained DTW and PAM (~20 seconds with 8 cores)
# kc.dtw <- dtwclust(CharTraj, k = 20, seed = 3251, control = ctrl)
# 
# ## Use constrained DTW with DBA centroids (~20 seconds with 8 cores)
# kc.dba <- dtwclust(CharTraj, k = 20, centroid = "dba", seed = 3251, control = ctrl)
# 
# ## This uses the "nDTW" function registered in another example above
# # For reference, this took around 2.25 minutes with 8 cores (all 8 repetitions).
# kc.ndtw.reps <- dtwclust(CharTraj, k = 20, distance = "nDTW", centroid = "dba",
#                          preproc = zscore, seed = 8319,
#                          control = list(window.size = 10L, nrep = 8L))
# 
# # See Rand index for each repetition
# sapply(kc.ndtw.reps, randIndex, y = CharTrajLabels)
# 
# # Stop parallel workers
# stopCluster(cl)
# 
# # Return to sequential computations
# registerDoSEQ()
# ## End(Not run)

Run the code above in your browser using DataLab