Ckmeans.1d.dp-package: Optimal, Fast, and Reproducible Univariate Clustering

Description

This package provides a powerful set of tools for univariate data analysis with guaranteed optimality, efficiency, and reproducibility. Four problems including univariate \(k\)-means, \(k\)-median, \(k\)-segments, and multi-channel weighted \(k\)-means are solved with guaranteed optimality and reproducibility. The core algorithm minimizes the (weighted) sum of within-cluster distances using respective metrics. Its advantage over heuristic clustering in efficiency and accuracy is increasingly pronounced as the number of clusters \(k\) increases. Weighted \(k\)-means can also optimally segment time series to perform peak calling. An auxiliary function generates histograms that are adaptive to patterns in data.

The Ckmeans.1d.dp algorithm clusters (weighted) univariate data given by a numeric vector \(x\) into \(k\) groups by dynamic programming Wang2011Ckmeans,song2020wucCkmeans.1d.dp. It guarantees the optimality of clustering---the total of within-cluster sums of squares is always the minimum given the number of clusters \(k\). In contrast, heuristic univariate clustering algorithms may be non-optimal or inconsistent from run to run. As non-negative weights are supported, the algorithm can partition a time course using time points as input and corresponding values as weight. Based on an optimal clustering, a function can generate histograms adaptive to patterns in data. The implementation of this algorithm is numerically stable.

A linear time solution. Excluding the time for sorting \(x\), the weighted univariate clustering algorithm takes a runtime of \(O(kn)\) song2020wucCkmeans.1d.dp, linear in both sample size \(n\) and the number of clusters \(k\). The approach was first introduced into version 3.4.9 (not publicly released) on July 16, 2016, using a new divide-and-conquer strategy integrating a previous theoretical result on matrix search aggarwal1987geometricCkmeans.1d.dp and a novel in-place search space reduction method song2020wucCkmeans.1d.dp.

A log-linear time solution. Since version 3.4.6, a divide-and-conquer strategy that is simple to code reduces the runtime from \(O(kn^2)\) down to \(O(kn\lg n)\) song2020wucCkmeans.1d.dp.

A quadratic time solution. Before version 3.4.6, Ckmeans.1d.dp uses an algorithm that runs in quadratic runtime \(O(kn^2)\) Wang2011CkmeansCkmeans.1d.dp.

A cubic time solution. Bellman1973;textualCkmeans.1d.dp first described a general dynamic programming strategy for solving univariate clustering problems with additive optimality measures. The strategy, however, did not address any specific characteristics of the \(k\)-means problem and its implied general algorithm will have a time complexity of \(O(kn^3)\) on an input of \(n\) points.

The space complexity in all cases is \(O(kn)\).

This package provides a powerful alternative to heuristic clustering and also new functionality for weighted clustering, segmentation, and peak calling with guaranteed optimality. It is practical for Ckmeans.1d.dp to find a few clusters on millions of sample points with optional weights in seconds using a single core on a typical desktop computer.

Third parties have ported various versions of this package to JavaScript, MATLAB, Python, Ruby, SAS, and Scala.

Arguments

Disclaimer

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.

Author

Joe Song, Hua Zhong, and Haizhou Wang

Details

The most important function of this package is Ckmeans.1d.dp that implements optimal univariate clustering either weighted or unweighted. It also contains an adaptive histogram function ahist, plotting functions plot.Ckmeans.1d.dp and plotBIC, and a print function print.Ckmeans.1d.dp.