Learn R Programming

R package stream - Infrastructure for Data Stream Mining

The package provides support for modeling and simulating data streams as well as an extensible framework for implementing, interfacing and experimenting with algorithms for various data stream mining tasks. The main advantage of stream is that it seamlessly integrates with the large existing infrastructure provided by R. The package provides:

  • Stream Sources: streaming from files, databases, in-memory data, URLs, pipes, socket connections and several data stream generators including dynamically streams with concept drift.
  • Stream Processing with filters (convolution, scaling, exponential moving average, …)
  • Stream Aggregation: sampling, windowing.
  • Stream Clustering: BICO, BIRCH, D-Stream, DBSTREAM, and evoStream.
  • Stream Outlier Detection based on D-Stream, DBSTREAM.
  • Stream Classification with DecisionStumps, HoeffdingTree, NaiveBayes and Ensembles (streamMOA via RMOA).
  • Stream Regression with Perceptron, FIMTDD, ORTO, … (streamMOA via RMOA).
  • Stream Mining Evaluation with prequential error estimation.

Additional packages in the stream family are:

  • streamMOA: Interface to clustering algorithms implemented in the MOA framework. The package interfaces clustering algorithms like of DenStream, ClusTree, CluStream and MCOD. The package also provides an interface to RMOA for MOA’s stream classifiers and stream regression models.
  • rEMM: Provides implementations of threshold nearest neighbor clustering (tNN) and Extensible Markov Model (EMM) for modelling temporal relationships between clusters.

Installation

Stable CRAN version: Install from within R with

install.packages("stream")

Current development version: Install from r-universe.

install.packages("stream", repos = "https://mhahsler.r-universe.dev")

Usage

Load the package and a random data stream with 3 Gaussian clusters and 10% noise and scale the data to z-scores.

library("stream")
set.seed(2000)

stream <- DSD_Gaussians(k = 3, d = 2, noise = 0.1) %>%
    DSF_Scale()
get_points(stream, n = 5)
##       X1     X2 .class
## 1 -0.267 -0.802      2
## 2  0.531  1.078     NA
## 3 -0.706  1.427      3
## 4 -0.781  1.355      3
## 5  1.170 -0.712      1
plot(stream)

Cluster a stream of 1000 points using D-Stream which estimates point density in grid cells.

dsc <- DSC_DStream(gridsize = 0.1)
update(dsc, stream, 1000)
plot(dsc, stream, grid = TRUE)

evaluate_static(dsc, stream, n = 100)
## Evaluation results for micro-clusters.
## Points were assigned to micro-clusters.
## 
##             numPoints      numMicroClusters      numMacroClusters 
##              100.0000               65.0000                3.0000 
##        noisePredicted                   SSQ            silhouette 
##               23.0000                0.1696                0.0786 
##       average.between        average.within          max.diameter 
##                1.7809                0.5816                3.9368 
##        min.separation ave.within.cluster.ss                    g2 
##                0.0146                0.5217                0.1596 
##          pearsongamma                  dunn                 dunn2 
##                0.0637                0.0037                0.0154 
##               entropy              wb.ratio            numClasses 
##                3.1721                0.3266                4.0000 
##           noiseActual        noisePrecision        outlierJaccard 
##               16.0000                0.6957                0.6957 
##             precision                recall                    F1 
##                0.6170                0.1618                0.2563 
##                purity             Euclidean             Manhattan 
##                0.9920                0.1633                0.3000 
##                  Rand                 cRand                   NMI 
##                0.7620                0.1688                0.5551 
##                    KP                 angle                  diag 
##                0.2651                0.3000                0.3000 
##                    FM               Jaccard                    PS 
##                0.3159                0.1470                0.0541 
##                    vi 
##                2.2264 
## attr(,"type")
## [1] "micro"
## attr(,"assign")
## [1] "micro"

Outlier detection using DBSTREAM which uses micro-clusters with a given radius.

dso <- DSOutlier_DBSTREAM(r = 0.1)
update(dso, stream, 1000)
plot(dso, stream)

evaluate_static(dso, stream, n = 100, measure = c("numPoints", "noiseActual", "noisePredicted",
    "noisePrecision"))
## Evaluation results for micro-clusters.
## Points were assigned to micro-clusters.
## 
##      numPoints    noiseActual noisePredicted noisePrecision 
##            100              7              7              1 
## attr(,"type")
## [1] "micro"
## attr(,"assign")
## [1] "micro"

Preparing complete stream process pipelines that can be run using a single update() call.

pipeline <- DSD_Gaussians(k = 3, d = 2, noise = 0.1) %>%
    DSF_Scale() %>%
    DST_Runner(DSC_DStream(gridsize = 0.1))
pipeline
## DST pipline runner
## DSD: Gaussian Mixture (d = 2, k = 3)
## + scaled
## DST: D-Stream 
## Class: DST_Runner, DST
update(pipeline, n = 500)
##     weight    X1    X2
## 1    0.812 -1.75 -0.65
## 2    0.888 -1.75 -0.55
## 3    1.738 -1.65 -1.05
## 4    0.865 -1.65 -0.75
## 5    3.333 -1.65 -0.65
## 6    1.890 -1.65 -0.55
## 7    1.677 -1.55 -0.95
## 8    1.590 -1.55 -0.85
## 9    1.432 -1.55 -0.55
## 10   0.773 -1.55 -0.45
## 11   3.288 -1.45 -0.95
## 12   1.712 -1.45 -0.85
## 13   3.466 -1.45 -0.75
## 14   2.514 -1.45 -0.65
## 15   1.582 -1.35 -1.15
## 16   1.804 -1.35 -1.05
## 17   4.806 -1.35 -0.95
## 18   5.170 -1.35 -0.85
## 19   2.521 -1.35 -0.65
## 20   0.803 -1.35 -0.55
## 21   0.973 -1.25 -1.15
## 22   0.842 -1.25 -1.05
## 23   4.945 -1.25 -0.95
## 24   4.176 -1.25 -0.85
## 25   4.267 -1.25 -0.75
## 26   3.513 -1.25 -0.65
## 27   1.585 -1.25 -0.55
## 28   0.961 -1.25 -0.45
## 29   0.825 -1.15 -1.15
## 30   1.819 -1.15 -1.05
## 31   0.846 -1.15 -0.95
## 32   3.410 -1.15 -0.85
## 33   1.716 -1.15 -0.75
## 34   0.765 -1.05 -1.15
## 35   4.891 -1.05 -1.05
## 36   2.603 -1.05 -0.95
## 37   6.110 -1.05 -0.85
## 38   0.957 -1.05 -0.65
## 39   1.895 -1.05 -0.55
## 40   0.788 -1.05 -0.45
## 41   2.581 -0.95 -1.15
## 42   2.556 -0.95 -1.05
## 43   3.479 -0.95 -0.95
## 44   4.340 -0.95 -0.85
## 45   2.560 -0.95 -0.75
## 46   0.845 -0.95 -0.55
## 47   2.898 -0.85 -1.05
## 48   2.616 -0.85 -0.95
## 49   2.545 -0.85 -0.85
## 50   1.556 -0.75 -1.05
## 51   0.965 -0.75 -0.95
## 52   0.771 -0.65 -1.05
## 53   0.930 -0.05  1.15
## 54   1.656  0.05  0.95
## 55   3.891  0.05  1.05
## 56   1.738  0.05  1.15
## 57   0.891  0.05  1.35
## 58   0.769  0.15  0.85
## 59   1.886  0.15  0.95
## 60   2.487  0.15  1.05
## 61   3.342  0.15  1.15
## 62   4.257  0.15  1.25
## 63   1.833  0.15  1.35
## 64   0.877  0.15  1.45
## 65   0.886  0.25  0.75
## 66   0.763  0.25  0.85
## 67   3.426  0.25  0.95
## 68   3.979  0.25  1.05
## 69   5.465  0.25  1.15
## 70   1.778  0.25  1.25
## 71   0.776  0.25  1.45
## 72   1.739  0.35 -0.35
## 73   0.749  0.35 -0.25
## 74   1.651  0.35 -0.15
## 75   0.959  0.35  0.65
## 76   2.146  0.35  0.75
## 77   1.510  0.35  0.85
## 78   2.335  0.35  0.95
## 79   5.286  0.35  1.05
## 80   5.187  0.35  1.15
## 81   3.046  0.35  1.25
## 82   3.268  0.35  1.35
## 83   1.489  0.35  1.45
## 84   0.957  0.35  1.55
## 85   0.757  0.45 -0.55
## 86   3.486  0.45 -0.45
## 87   1.584  0.45 -0.35
## 88   1.948  0.45 -0.25
## 89   4.212  0.45 -0.15
## 90   1.520  0.45 -0.05
## 91   0.722  0.45  0.65
## 92   2.474  0.45  0.75
## 93   2.579  0.45  0.85
## 94   3.733  0.45  0.95
## 95   4.920  0.45  1.05
## 96   3.280  0.45  1.15
## 97   1.693  0.45  1.25
## 98   1.735  0.45  1.35
## 99   0.945  0.45  1.45
## 100  0.821  0.55 -0.55
## 101  2.516  0.55 -0.45
## 102  4.479  0.55 -0.35
## 103  1.700  0.55 -0.25
## 104  1.714  0.55  0.85
## 105  4.318  0.55  0.95
## 106  2.532  0.55  1.05
## 107  1.520  0.55  1.15
## 108  2.543  0.55  1.25
## 109  1.651  0.55  1.35
## 110  0.785  0.55  1.45
## 111  0.980  0.65 -0.65
## 112  0.791  0.65 -0.55
## 113  5.131  0.65 -0.45
## 114  7.010  0.65 -0.35
## 115  0.758  0.65 -0.15
## 116  1.573  0.65  0.75
## 117  0.837  0.65  0.85
## 118  1.638  0.65  0.95
## 119  5.988  0.65  1.05
## 120  1.602  0.65  1.15
## 121  1.776  0.65  1.25
## 122  0.861  0.65  1.35
## 123  0.849  0.75 -0.85
## 124  0.893  0.75 -0.75
## 125  2.393  0.75 -0.65
## 126  2.672  0.75 -0.55
## 127  3.460  0.75 -0.45
## 128  5.000  0.75 -0.35
## 129  2.555  0.75 -0.25
## 130  2.245  0.75 -0.15
## 131  1.807  0.75  0.15
## 132  2.679  0.75  0.95
## 133  0.927  0.75  1.05
## 134  2.551  0.75  1.15
## 135  0.746  0.75  1.25
## 136  1.761  0.85 -0.85
## 137  0.798  0.85 -0.65
## 138  1.926  0.85 -0.55
## 139  0.826  0.85 -0.45
## 140  2.570  0.85 -0.25
## 141  0.923  0.85 -0.15
## 142  1.671  0.85 -0.05
## 143  0.818  0.85  0.95
## 144  0.774  0.85  1.15
## 145  0.770  0.95 -0.75
## 146  1.576  0.95 -0.65
## 147  0.733  0.95 -0.55
## 148  1.654  0.95 -0.45
## 149  4.020  0.95 -0.35
## 150  1.855  0.95 -0.25
## 151  1.540  1.05 -0.75
## 152  0.802  1.05 -0.65
## 153  2.753  1.05 -0.55
## 154  1.490  1.05 -0.45
## 155  1.586  1.05 -0.35
## 156  2.697  1.05 -0.25
## 157  0.911  1.05 -0.15
## 158  1.662  1.15 -0.65
## 159  0.781  1.15 -0.45
## 160  0.883  1.15 -0.25
pipeline$dst
## D-Stream 
## Class: DSC_DStream, DSC_Micro, DSC_R, DSC 
## Number of micro-clusters: 160 
## Number of macro-clusters: 13

Acknowledgements

The development of the stream package was supported in part by NSF IIS-0948893, NSF CMMI 1728612, and NIH R21HG005912.

References

Copy Link

Version

Install

install.packages('stream')

Monthly Downloads

899

Version

2.0-1

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Michael Hahsler

Last Published

February 28th, 2023

Functions in stream (2.0-1)

DSC

Data Stream Clustering Base Class
DSC_DStream

D-Stream Data Stream Clustering Algorithm
DSC_DBSCAN

DBSCAN Macro-clusterer
DSAggregate_Window

Sliding Window (Data Stream Operator)
DSC_DBSTREAM

DBSTREAM Clustering Algorithm
DSAggregate_Sample

Sampling from a Data Stream (Data Stream Operator)
DSAggregate

Data Stream Aggregator Base Classes
DSC_BIRCH

Balanced Iterative Reducing Clustering using Hierarchies
DSC_BICO

BICO - Fast computation of k-means coresets in a data stream
DSC_EA

Reclustering using an Evolutionary Algorithm
DSC_Sample

Extract a Fixed-size Sample from a Data Stream
DSD_MG

DSD Moving Generator
DSD_Gaussians

Mixture of Gaussians Data Stream Generator
DSD

Data Stream Data Generator Base Classes
DSD_BarsAndGaussians

Data Stream Generator for Bars and Gaussians
DSC_Reachability

Reachability Micro-Cluster Reclusterer
DSC_R

Abstract Class for Implementing R-based Clusterers
DSD_Target

Target Data Stream Generator
DSD_UniformNoise

Uniform Noise Data Stream Generator
DSD_Benchmark

Data Stream Generator for Dynamic Data Stream Benchmarks
DSD_Cubes

Static Cubes Data Stream Generator
DSC_SlidingWindow

DSC_SlidingWindow -- Data Stream Clusterer Using a Sliding Window
DSC_Hierarchical

Hierarchical Micro-Cluster Reclusterer
DSClassifier_SlidingWindow

DSClassifier_SlidingWindow -- Data Stream Classifier Using a Sliding Window
DSClassifier

Abstract Class for Data Stream Classifiers
DSC_Kmeans

Kmeans Macro-clusterer
DSD_Memory

A Data Stream Interface for Data Stored in Memory
DSF_Scale

Scale a Data Stream
DSF_dplyr

Apply a dplyr Transformation to a Data Stream
DSD_Mixture

Mixes Data Points from Several Streams into a Single Stream
DSC_Window

A sliding window from a Data Stream
DSC_Micro

Abstract Class for Micro Clusterers (Online Component)
DSC_evoStream

evoStream - Evolutionary Stream Clustering
DSC_Macro

Abstract Class for Macro Clusterers (Offline Component)
DSD_ReadStream

Read a Data Stream from a File or a Connection
DSD_ScaleStream

Deprecated DSD_ScaleStream
DSC_Static

Create as Static Copy of a Clustering
DSOutlier

Abstract Class for Data Stream Outlier Detectors
DSC_TwoStage

TwoStage Clustering Process
DSD_mlbenchData

Stream Interface for Data Sets From mlbench
DSF_Convolve

Apply a Filter to a Data Stream
animate_data

Animates the Plotting of a Data Streams
close_stream

Close a Data Stream
animate_cluster

Animates Plots of the Clustering Process
agreement

Agreement-based Measures for Clustering
DST_Multi

Apply Multiple Task to the Same Data Stream
DSFP

Abstract Class for Frequent Pattern Mining Algorithms for Data Streams
DST_WriteStream

Task to Write a Stream to a File or a Connection
DSD_mlbenchGenerator

mlbench Data Stream Generator
DST_SlidingWindow

DST_SlidingWindow -- Call R Functions on a Sliding Window
DSRegressor

Abstract Class for Data Stream Regressors
DSF

Data Stream Filter Base Classes
MGC

Moving Generator Cluster
DSF_Downsample

Downsample a Data Stream
predict

Make a Prediction for a Data Stream Mining Task
evaluate.DSC

Evaluate a Stream Clustering Task
prune_clusters

Prune Clusters from a Clustering
stream_pipeline

Create a Data Stream Pipeline
plot.DSC

Plot Results of a Data Stream Clustering
update

Update a Data Stream Mining Task Model with Points from a Stream
plot.DSD

Plot Data Stream Data
write_stream

Write a Data Stream to a File
evaluate

Evaluate a Data Stream Mining Task
DSF_Func

Apply a Function to Transformation to a Data Stream
DSRegressor_SlidingWindow

DSRegressor_SlidingWindow -- Data Stream Regressor Using a Sliding Window
DSD_NULL

Placeholder for a DSD Stream
DSD_ReadDB

Read a Data Stream from an open DB Query
DSF_ExponentialMA

Exponential Moving Average over a Data Stream
get_assignment

DST

Conceptual Base Class for All Data Stream Mining Tasks
get_points

Get Points from a Data Stream Generator
stream-package

stream: Infrastructure for Data Stream Mining
reset_stream

Reset a Data Stream to its Beginning
read_saveDSC

Save and Read DSC Objects
recluster

Re-clustering micro-clusters