dtwclust: Time series clustering under DTW

Description

This function uses the DTW distance and related lower bounds to cluster time series. For now, all series must be univariate.

Usage

dtwclust(data = NULL, type = "partitional", k = 2L, method = "average",
  distance = "dtw", centroid = "pam", preproc = NULL,
  window.size = NULL, norm = "L1", dc = NULL, dba.iter = 15L,
  pam.precompute = TRUE, control = NULL, save.data = TRUE, seed = NULL,
  trace = FALSE, reps = 1L, packages = character(0), ...)

Arguments

data

A list where each element is a time series, or a numerical matrix where each row is a time series. All series must have equal lengths in case of type = "tadpole".

type

What type of clustering method to use, partitional, hierarchical or tadpole.

Numer of desired clusters in partitional methods.

method

Which linkage method to use in hierarchical methods. See hclust.

distance

One of the supported distance definitions (see Distance section). Ignored for type = "tadpole".

centroid

Either a supported string or an appropriate function to calculate centroids when using partitional methods (see Centroid section).

preproc

Function to preprocess data. Defaults to zscore only if centroid = "shape", but will be replaced by a custom function if provided. See Preprocessing section.

window.size

Window constraint for DTW and LB calculations. See Sakoe-Chiba section.

norm

Pointwise distance for DTW, DBA and the LB. Either L1 for Manhattan distance or L2 for Euclidean. Ignored for distance = "DTW" (which always uses L1) and distance = "DTW2" (which always uses

Cutoff distance for TADPole algorithm.

dba.iter

Maximum number of iterations for DBA centroids.

pam.precompute

Precompute the whole distance matrix once and reuse it at each iteration if using PAM centroids. Otherwise calculate distances at every iteration.

control

Parameters for partitional clustering algorithms. See flexclustControl.

save.data

Return a copy of the data in the returned object? Ignored for hierarchical clustering.

seed

Random seed for reproducibility of partitional algorithms.

trace

Boolean flag. If true, more output regarding the progress is printed to screen.

reps

How many times to repeat partitional clustering with different starting points. See section Repetitions.

packages

Character vector with the names of any packages required for custom proxy functions. See Parallel Computing section.

...

Additional arguments to pass to dist or a custom function.

Value

An object with formal class dtwclust-class if type = "partitional" | "tadpole". Otherwise an object with class hclust as returned by hclust. If reps > 1 and a partitional procedure is used, a list of objects is returned.

Distance

If a custom distance function is provided, it will receive the data as the first argument.

For partitional algorithms, the second argument will be the cluster centers (i.e. other time series). If data is a matrix, it will be coerced to a list of series, and the centers will also be provided in the form of a list.

If hierarchical algorithms are used, the function will also receive the elements of ....

The function should return a distance matrix, ideally of class crossdist. The time series in the data should be along the rows, and the cluster centers along the columns of the distance matrix.

The other option is to provide a string. The string can represent a compatible registered distance of proxy's dist. Extra parameters can be provided in .... See the examples but please finish reading this section before that.

Additionally, with either type of algorithm, it can be a string of one of the following custom implementations (all registered with proxy):

"dtw": DTW with L1 norm and optionally a Sakoe-Chiba/Slanted-band constraint.
"dtw2": DTW with L2 norm and optionally a Sakoe-Chiba/Slanted-band constraint.
"dtw_lb": DTW with L1 or L2 norm and optionally a Sakoe-Chiba constraint. Some computations are avoided by first estimating the distance matrix with Lemire's lower bound and then iteratively refining with DTW. Seedtw_lb.
"lbk": Keogh's lower bound with either L1 or L2 norm for the Sakoe-Chiba constraint.
"lbi": Lemire's lower bound with either L1 or L2 norm for the Sakoe-Chiba constraint.
"sbd": Shape-based distance. Each series is z-normalized in this case. As a result, the cluster centers (for partitional methods) are also z-normalized. SeeSBDfor more details.

Note that only dtw, dtw2 and sbd support series of different lengths.

If the user wants to create its own distance and register it with proxy, it should include a ... in its definition so that it is correctly called. Such functions will always receive the following parameters:

window.type: Either"none"for aNULLwindow.size, or"slantedband"otherwise
window.size: The provided window size
norm: The provided desired norm
...: Any additional parameters provided in the original call's ellipsis

Whether the function makes use of them or not, is up to the user.

Centroid

In the case of partitional algorithms, a suitable function should calculate the cluster centers. In this case, the centers are themselves time series.

If a custom function is provided, it will receive different inputs depending on the format of data:

For matrix input, it will receive a matrix as single input. Each row will be a series that belongs to a given cluster. The function should return a numeric vector with the centroid time series.
For a list input, the function will receive three inputs in the following order: thewholedata list; a numeric vector with length equal to the number of series indata, indicating which cluster a series belongs to; the current number of total clusters.

The other option is to provide a character string for the custom implementations. The following options are available:

"mean": The average along each dimension. In other words, the average of all$x^j_i$among the$j$series that belong to the same cluster for all time points$t_i$.
"median": The median along each dimension. Similar to mean.
"shape": Shape averaging. Seeshape_extractionfor more details.
"dba": DTW Barycenter Averaging. SeeDBAfor more details.
: "pam": Partition around medoids. This basically means that the cluster centers are always one of the time series in the data. In this case, the distance matrix can be pre-computed once using all time series in the data and then re-used at each iteration. It usually saves overhead overall.

Note that only dba and pam support series of different lengths.

Sakoe-Chiba Constraint

A global constraint to speed up the DTW calculation is the Sakoe-Chiba band (Sakoe and Chiba, 1978). To use it, a window width must be defined via window.size.

The windowing constraint uses a centered window. The calculations expect a value in window.size that represents the distance between the point considered and one of the edges of the window. Therefore, if, for example, window.size = 10, the warping for an observation $x_i$ considers the points between $x_{i-10}$ and $x_{i+10}$, resulting in 10*2 + 1 = 21 observations falling within the window.

The computations actually use a slantedband window, which is equivalent to the Sakoe-Chiba one if series have equal lengths, and stays along the diagonal of the local cost matrix if series have different lengths.

Preprocessing

It is strongly advised to use z-normalization in case of centroid = "shape", because the resulting series have this normalization (see shape_extraction). The user can, however, specify a custom function that performs any transformation on the data, but the user must make sure that the format stays consistent, i.e. a matrix where each row is a series or a list of time series. For example, the z-normalization could be implemented as t(apply(data, 1, zscore)) or lapply(data, zscore) respectively.

The function will receive the data as first argument and, in case hierarchical methods are used, the contents of ... as the second argument.

Repetitions

Due to their stochastic nature, partitional clustering is usually repeated several times with different random seeds to allow for different starting points. This function can now run several repetitions by using the doRNG package. This package ensures that each repetition uses a statistically independent random sequence. The user only needs to provide an initial seed in the corresponding parameter of this function.

Repetitions are greatly optimized when PAM centroids are used and the whole distance matrix is precomputed, since said matrix is reused for every repetition, and can be comptued in parallel (see next section).

Parallel Computing

Please note that running tasks in parallel does not guarantee faster computations. The overhead introduced is sometimes too large, and it's better to run tasks sequentially.

The user can register a parallel backend with the doParallel package (and possibly any other package compatible with foreach's %dopar% operator) in order to do the repetitions in parallel, as well as distance calculations (see the examples). TADPole also takes advantage of parallel support.

Unless each repetitions requires a few seconds, parallel computing probably isn't worth it. As such, I would only use this feature with shape and DBA centroids, or for pam.precompute = FALSE with a relatively small k and an expensive distance function like DTW.

If you do more than 1 repetition sequentially, you can safely ignore the warning given by dopar about no parallel backend registration.

If the user registers a parallel backend, the function will also try to do the calculation of the distance matrices in parallel. This should work with any function registered with dist via pr_DB whose loop flag is set to TRUE. If the function requires special packages to be loaded, provide their names in the packages argument. In addition, "dtwclust" is always loaded in each parallel worker, so the user doesn't need to include that.

Note that, by default, if a parallel backend is registered, multiple repetitions are to be performed (partitional clustering, reps >= 1) AND centroid != "pam", each parallel worker will get a repetition task, but any distance calculations within each worker will be done sequentially. Load balance for such a scenario should be fine as long as the reps >= the number of parallel workers. If you believe your task would benefit more from parallelization within each repetition, consider registering the parallel backend and calling dtwclust several times sequentially, with reps = 1 and different seeds.

Notes

Notice that the lower bounds are defined only for time series of equal lengths. DTW and DTW2 don't require this, but they are much slower to compute.

The lower bounds are not symmetrical, and DTW is only symmetrical if series are of equal lengths.

Specifying distance = "sbd" and centroid = "shape" is equivalent to the k-Shape algorithm (Papparizos and Gravano, 2015). See SBD and shape_extraction for more info.

Details

Partitional algorithms are implemented via kcca. Hierarchical algorithms use the hclust function. The tadpole algorithm uses the TADPole function.

The data may be a matrix or a list, but the matrix will be coerced to a list. A matrix input requires that all time series have equal lengths. If the lengths vary slightly between time series, reinterpolating them to a common length is most likely an acceptable approach (Ratanamahatana and Keogh, 2004). If this is not the case, then clustering them directly is probably ill-advised. See the examples.

References

Sakoe H and Chiba S (1978). ``Dynamic programming algorithm optimization for spoken word recognition.'' Acoustics, Speech and Signal Processing, IEEE Transactions on, 26(1), pp. 43-49. ISSN 0096-3518, http://doi.org/10.1109/TASSP.1978.1163055.

Ratanamahatana A and Keogh E (2004). ``Everything you know about dynamic time warping is wrong.'' In 3rd Workshop on Mining Temporal and Sequential Data, in conjunction with 10th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD-2004), Seattle, WA.

Paparrizos J and Gravano L (2015). ``k-Shape: Efficient and Accurate Clustering of Time Series.'' In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, series SIGMOD '15, pp. 1855-1870. ISBN 978-1-4503-2758-9, http://doi.org/10.1145/2723372.2737793.

Examples

Run this code

#### Load data
data(uciCT)

# Reinterpolate to same length and coerce as matrix
data <- t(sapply(CharTraj, reinterpolate, newLength = 180))

# Subset for speed
data <- data[1:20, ]
labels <- CharTrajLabels[1:20]

#### Simple partitional clustering with L2 distance and PAM
kc.l2 <- dtwclust(data, k = 4, distance = "L2", centroid = "pam",
                  seed = 3247, trace = TRUE)
cat("Rand index for L2+PAM:", randIndex(kc.l2, labels), "\n\n")

#### TADPole clustering
kc.tadp <- dtwclust(data, type = "tadpole", k = 4,
                    window.size = 20, dc = 1.5,
                    trace = TRUE)
cat("Rand index for TADPole:", randIndex(kc.tadp, labels), "\n\n")
plot(kc.tadp)

# Modify plot
plot(kc.tadp, clus = 3:4, labs.arg = list(title = "TADPole, clusters 3 and 4",
                                          x = "time", y = "series"))

#### Registering a custom distance with the 'proxy' package and using it
# Normalized DTW distance
ndtw <- function(x, y, ...) {
  dtw::dtw(x, y, step.pattern = symmetric2,
           distance.only = TRUE, ...)$normalizedDistance
}

# Registering the function with 'proxy'
proxy::pr_DB$set_entry(FUN = ndtw, names=c("nDTW"),
                       loop = TRUE, type = "metric", distance = TRUE,
                       description = "Normalized DTW with L1 norm")

# Subset of (original) data for speed
# Change pam.precompute to FALSE to see time difference
kc.ndtw <- dtwclust(CharTraj[31:40], distance = "nDTW",
                    trace = TRUE, pam.precompute = TRUE,
                    seed = 8319)
cat("Rand index for nDTW (subset):",
    randIndex(kc.ndtw, CharTrajLabels[31:40]), "\n\n")
plot(kc.ndtw)

#### Hierarchical clustering based on shabe-based distance (different lengths)
hc.sbd <- dtwclust(CharTraj, type = "hierarchical",
                   distance = "sbd", trace = TRUE)
cl.sbd <- cutree(hc.sbd, 20)
cat("Rand index for HC+SBD:", randIndex(cl.sbd, CharTrajLabels), "\n\n")

#### Saving and modifying the ggplot object with custom time
t <- seq(Sys.Date(), len = 180, by = "day")
gkc <- plot(kc.l2, time = t, plot = FALSE)

require(scales)
gkc + scale_x_date(labels = date_format("%b-%Y"),
                   breaks = date_breaks("2 months"))

#### Using parallel computation to optimize several random repetitions
#### and distance matrix calculation
require(doParallel)

# Create parallel workers
cl <- makeCluster(detectCores())
registerDoParallel(cl)

## Use full DTW and PAM
kc.dtw <- dtwclust(CharTraj, k = 20, seed = 3251, trace = TRUE)

## Use full DTW with DBA centroids
kc.dba <- dtwclust(CharTraj, k = 20, centroid = "dba", seed = 3251, trace = TRUE)

## Use constrained DTW with original series of different lengths
kc.cdtw <- dtwclust(CharTraj, k = 20, window.size = 20,
                    seed = 3251, trace = TRUE)

## This uses the "nDTW" function registered in another example above
# For reference, this took around 2.25 minutes with 8 cores (all 8 repetitions).
kc.ndtw.list <- dtwclust(CharTraj, k = 20, distance = "nDTW",
                         centroid = "dba", window.size = 10,
                         preproc = zscore, seed = 8319,
                         save.data = FALSE, reps = 8L)

# Stop parallel workers
stopCluster(cl)

# Return to sequential computations
registerDoSEQ()

# See Rand Index for each repetition
sapply(kc.ndtw.list, randIndex, y = CharTrajLabels)

Run the code above in your browser using DataLab