ahist: Adaptive Histograms

Description

Generate or plot histograms adaptive to patterns in univariate data. The number and widths of histogram bins are automatically calculated based on an optimal univariate clustering of input data. Thus the bins are unlikely of equal width.

Usage

ahist(x, k = c(1,9), breaks=NULL, data=NULL, weight=1,
      plot = TRUE, xlab = deparse(substitute(x)),
      wlab = deparse(substitute(weight)),
      main = NULL, col = "lavender", border = graphics::par("fg"),
      lwd = graphics::par("lwd"),
      col.stick = "gray", lwd.stick = 1, add.sticks=TRUE,
      style = c("discontinuous", "midpoints"),
      skip.empty.bin.color=TRUE,
      ...)

Value

An object of class histogram defined in hist. It has a S3 plot method plot.histogram.

Arguments

x

a numeric vector of data or an object of class "Ckmeans.1d.dp".

If x is a numeric vector, all NA elements must be removed from x before calling this function.

If x is an object of class "Ckmeans.1d.dp", the clustering information in x will be used and the data argument contains the numeric vector to be plotted.

k

either an exact integer number of bins/clusters, or a vector of length two specifying the minimum and maximum numbers of bins/clusters to be examined. The default is c(1,9). When k is a range, the actual number of clusters is determined by Bayesian information criterion. This argument is ignored if x is an object of class "Ckmeans.1d.dp".

breaks

This argument is defined in hist. If this argument is provided, optimal univariate clustering is not applied to obtain the histogram, but instead the histogram will be generated by the hist function in graphics, except that sticks representing data can still be optionally plotted by specifying the add.sticks=TRUE argument.

data

a numeric vector. If x is an object of class "Ckmeans.1d.dp", the data argument must be provided. If x is a numeric vector, this argument is ignored.

weight

a value of 1 to specify equal weights or a numeric vector of unequal weights for each element. The default weight is one. It is highly recommended to use positive (instead of zero) weights to account for the influence of every element. The weights have a strong impact on the clustering result.

plot

a logical. If TRUE, the histogram will be plotted.

xlab

a character string. The x-axis label for the plot.

wlab

a character string. The weight-axis label for the plot. It is the vertical axis to the right of the plot.

main

a character string. The title for the plot.

col

a character string. The fill color of the histogram bars.

border

a character string. The color of the histogram bar borders.

lwd

a numeric value. The line width of the border of the histogram bars

col.stick

a character string. The color of the sticks above the x-axis. See Details.

lwd.stick

a numeric value. The line width of the sticks above the x-axis. See Details.

add.sticks

a logical. If TRUE (default), the sticks representing the data will be added to the bottom of the histogram. Otherwise, sticks are not plotted.

style

a character string. The style of the adaptive histogram. See details.

skip.empty.bin.color

a logical. If TRUE (default), an empty bin (invisible) will be assigned the same bar color with the next bin. This is useful when all provided colors are to be used for non-empty bins. If FALSE, each bin will be assigned a bar color from col. A value of TRUE will coordinate the bar and stick colors.

...

additional arguments to be passed to hist or plot.histogram.

Author

Joe Song

Details

The histogram is by default plotted using the plot.histogram method. The plot can be optionally disabled with the plot=FALSE argument. The original input data are shown as sticks just above the horizontal axis.

If the breaks argument is not specified, the number of histogram bins is the optimal number of clusters estimated using Bayesian information criterion evaluated on Gaussian mixture models fitted to the input data in x.

If not provided with the breaks argument, breaks in the histogram are derived from clusters identified by optimal univariate clustering (Ckmeans.1d.dp) in two styles. With the default "discontinuous" style, the bin width of each bar is determined according to a data-adaptive rule; the "midpoints" style uses the midpoints of cluster border points to determine the bin-width. For clustered data, the "midpoints" style generates bins that are too wide to capture the cluster patterns. In contrast, the "discontinuous" style is more adaptive to the data by allowing some bins to be empty making the histogram bars discontinuous.

Examples

Run this code

# Example 1: plot an adaptive histogram from data generated by
#   a Gaussian mixture model with three components
x <- c(rnorm(40, mean=-2, sd=0.3),
       rnorm(45, mean=1, sd=0.1),
       rnorm(70, mean=3, sd=0.2))
ahist(x, col="lightblue", sub=paste("n =", length(x)),
      col.stick="salmon", lwd=2,
      main=paste("Example 1. Gaussian mixture model with 3 components",
                 "(one bin per component)", sep="\n"))


# Example 2: plot an adaptive histogram from data generated by
#   a Gaussian mixture model with three components using a given
#   number of bins
ahist(x, k=9, col="lavender", col.stick="salmon",
      sub=paste("n =", length(x)), lwd=2,
      main=paste("Example 2. Gaussian mixture model with 3 components",
                 "(on average 3 bins per component)", sep="\n"))

# Example 3: The DNase data frame has 176 rows and 3 columns of
#   data obtained during development of an ELISA assay for the
#   recombinant protein DNase in rat serum.

data(DNase)
res <- Ckmeans.1d.dp(DNase$density)
kopt <- length(res$size)
ahist(res, data=DNase$density, col=rainbow(kopt),
      col.stick=rainbow(kopt)[res$cluster],
      sub=paste("n =", length(x)), border="transparent",
      xlab="Optical density of protein DNase",
      main="Example 3. Elisa assay of DNase in rat serum")


# Example 4: Add sticks to histograms with the R provided
#   hist() function.

ahist(DNase$density, breaks="Sturges", col="palegreen",
      add.sticks=TRUE, col.stick="darkgreen",
      main=paste("Example 4. Elisa assay of DNase in rat serum",
                 "(Equal width bins)", sep="\n"),
      xlab="Optical density of protein DNase")

# Example 5: Weighted adatpive histograms

x <- sort(c(rnorm(40, mean=-2, sd=0.3),
       rnorm(45, mean=2, sd=0.1),
       rnorm(70, mean=4, sd=0.2)))

y <- (1 + sin(0.10 * seq_along(x))) * (x-1)^2

ahist(x, weight=y, sub=paste("n =", length(x)),
      col.stick="forestgreen", lwd.stick=0.25, lwd=2,
      main="Example 5. Weighted adaptive histogram")


# Example 6: Cluster data with repetitive elements

ahist(c(1,1,1,1, 3,4,4, 6,6,6), k=c(2,4), col="gray",
      lwd=2, lwd.stick=6, col.stick="chocolate",
      main=paste("Example 6. Adaptive histogram of",
                 "repetitive elements", sep="\n"))

Run the code above in your browser using DataLab