Last chance! 50% off unlimited learning
Sale ends in
The (S3) generic function density
computes kernel density
estimates. Its default method does so with the given kernel and
bandwidth for univariate observations.
density(x, …)
# S3 method for default
density(x, bw = "nrd0", adjust = 1,
kernel = c("gaussian", "epanechnikov", "rectangular",
"triangular", "biweight",
"cosine", "optcosine"),
weights = NULL, window = kernel, width,
give.Rkern = FALSE,
n = 512, from, to, cut = 3, na.rm = FALSE, …)
the data from which the estimate is to be computed. For the default method a numeric vector: long vectors are not supported.
the smoothing bandwidth to be used. The kernels are scaled such that this is the standard deviation of the smoothing kernel. (Note this differs from the reference books cited below, and from S-PLUS.)
bw
can also be a character string giving a rule to choose the
bandwidth. See bw.nrd
. The default,
"nrd0"
, has remained the default for historical and
compatibility reasons, rather than as a general recommendation,
where e.g., "SJ"
would rather fit, see also Venables and
Ripley (2002).
The specified (or computed) value of bw
is multiplied by
adjust
.
the bandwidth used is actually adjust*bw
.
This makes it easy to specify values like ‘half the default’
bandwidth.
a character string giving the smoothing kernel
to be used. This must partially match one of "gaussian"
,
"rectangular"
, "triangular"
, "epanechnikov"
,
"biweight"
, "cosine"
or "optcosine"
, with default
"gaussian"
, and may be abbreviated to a unique prefix (single
letter).
"cosine"
is smoother than "optcosine"
, which is the
usual ‘cosine’ kernel in the literature and almost MSE-efficient.
However, "cosine"
is the version used by S.
numeric vector of non-negative observation weights,
hence of same length as x
. The default NULL
is
equivalent to weights = rep(1/nx, nx)
where nx
is the
length of (the finite entries of) x[]
.
this exists for compatibility with S; if given, and
bw
is not, will set bw
to width
if this is a
character string, or to a kernel-dependent multiple of width
if this is numeric.
logical; if true, no density is estimated, and
the ‘canonical bandwidth’ of the chosen kernel
is returned
instead.
the left and right-most points of the grid at which the
density is to be estimated; the defaults are cut * bw
outside
of range(x)
.
by default, the values of from
and to
are
cut
bandwidths beyond the extremes of the data. This allows
the estimated density to drop to approximately zero at the extremes.
logical; if TRUE
, missing values are removed
from x
. If FALSE
any missing values cause an error.
further arguments for (non-default) methods.
If give.Rkern
is true, the number "density"
whose
underlying structure is a list containing the following components.
the n
coordinates of the points where the density is
estimated.
the estimated density values. These will be non-negative, but can be zero.
the bandwidth used.
the sample size after elimination of missing values.
the call which produced the result.
the deparsed name of the x
argument.
logical, for compatibility (always FALSE
).
The print method reports summary values on the x and y components.
The algorithm used in density.default
disperses the mass of the
empirical distribution function over a regular grid of at least 512
points and then uses the fast Fourier transform to convolve this
approximation with a discretized version of the kernel and then uses
linear approximation to evaluate the density at the specified points.
The statistical properties of a kernel are determined by
bw
is the standard deviation of the kernel) and
give.Rkern = TRUE
. See the examples for using exact equivalent
bandwidths.
Infinite values in x
are assumed to correspond to a point mass at
+/-Inf
and the density estimate is of the sub-density on
(-Inf, +Inf)
.
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). The New S Language. Wadsworth & Brooks/Cole (for S version).
Scott, D. W. (1992). Multivariate Density Estimation. Theory, Practice and Visualization. New York: Wiley.
Sheather, S. J. and Jones, M. C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society series B, 53, 683--690. http://www.jstor.org/stable/2345597.
Silverman, B. W. (1986). Density Estimation. London: Chapman and Hall.
Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S. New York: Springer.
# NOT RUN {
require(graphics)
plot(density(c(-20, rep(0,98), 20)), xlim = c(-4, 4)) # IQR = 0
# The Old Faithful geyser data
d <- density(faithful$eruptions, bw = "sj")
d
plot(d)
plot(d, type = "n")
polygon(d, col = "wheat")
## Missing values:
x <- xx <- faithful$eruptions
x[i.out <- sample(length(x), 10)] <- NA
doR <- density(x, bw = 0.15, na.rm = TRUE)
lines(doR, col = "blue")
points(xx[i.out], rep(0.01, 10))
## Weighted observations:
fe <- sort(faithful$eruptions) # has quite a few non-unique values
## use 'counts / n' as weights:
dw <- density(unique(fe), weights = table(fe)/length(fe), bw = d$bw)
utils::str(dw) ## smaller n: only 126, but identical estimate:
stopifnot(all.equal(d[1:3], dw[1:3]))
## simulation from a density() fit:
# a kernel density fit is an equally-weighted mixture.
fit <- density(xx)
N <- 1e6
x.new <- rnorm(N, sample(xx, size = N, replace = TRUE), fit$bw)
plot(fit)
lines(density(x.new), col = "blue")
(kernels <- eval(formals(density.default)$kernel))
## show the kernels in the R parametrization
plot (density(0, bw = 1), xlab = "",
main = "R's density() kernels with bw = 1")
for(i in 2:length(kernels))
lines(density(0, bw = 1, kernel = kernels[i]), col = i)
legend(1.5,.4, legend = kernels, col = seq(kernels),
lty = 1, cex = .8, y.intersp = 1)
## show the kernels in the S parametrization
plot(density(0, from = -1.2, to = 1.2, width = 2, kernel = "gaussian"),
type = "l", ylim = c(0, 1), xlab = "",
main = "R's density() kernels with width = 1")
for(i in 2:length(kernels))
lines(density(0, width = 2, kernel = kernels[i]), col = i)
legend(0.6, 1.0, legend = kernels, col = seq(kernels), lty = 1)
##-------- Semi-advanced theoretic from here on -------------
# }
# NOT RUN {
<!-- %% i.e. "secondary example" in a new help system ... -->
# }
# NOT RUN {
(RKs <- cbind(sapply(kernels,
function(k) density(kernel = k, give.Rkern = TRUE))))
100*round(RKs["epanechnikov",]/RKs, 4) ## Efficiencies
bw <- bw.SJ(precip) ## sensible automatic choice
plot(density(precip, bw = bw),
main = "same sd bandwidths, 7 different kernels")
for(i in 2:length(kernels))
lines(density(precip, bw = bw, kernel = kernels[i]), col = i)
## Bandwidth Adjustment for "Exactly Equivalent Kernels"
h.f <- sapply(kernels, function(k)density(kernel = k, give.Rkern = TRUE))
(h.f <- (h.f["gaussian"] / h.f)^ .2)
## -> 1, 1.01, .995, 1.007,... close to 1 => adjustment barely visible..
plot(density(precip, bw = bw),
main = "equivalent bandwidths, 7 different kernels")
for(i in 2:length(kernels))
lines(density(precip, bw = bw, adjust = h.f[i], kernel = kernels[i]),
col = i)
legend(55, 0.035, legend = kernels, col = seq(kernels), lty = 1)
# }
Run the code above in your browser using DataLab