These functions compute the `coverage coefficient' \(R_C\) for local principal curves, local principal points (i.e., kernel density estimates obtained through iterated mean shift), and other principal objects.
Rc(x,...)# S3 method for lpc
Rc(x,...)
# S3 method for lpc.spline
Rc(x,...)
# S3 method for ms
Rc(x,...)
base.Rc(data, closest.coords, type="curve")
an object used to select a method.
Further arguments passed to or from other methods (not needed yet).
A data matrix.
A matrix of coordinates of the projected data.
For principal curves, don't modify. For principal points, set "points".
J. Einbeck.
Rc
computes the coverage coefficient \(R_C\), a quantity which
estimates the goodness-of-fit of a fitted principal object. This
quantity can be interpreted similar to the coefficient of determination in
regression analysis: Values close to 1 indicate a good fit, while values
close to 0 indicate a `bad' fit (corresponding to linear PCA).
For objects of type lpc
, lpc.spline
, and ms
, S3 methods are available which use the generic function
Rc
. This, in turn, calls the base function base.Rc
, which
can also be used manually if the fitted object is of another class.
In principle, function base.Rc
can be used for assessing
goodness-of-fit of any principal object provided that
the coordinates (closest.coords
) of the projected data are
available. For instance, for HS principal curves fitted via
princurve
, this information is contained in component $s
,
and for a a k-means object, say fitk
, this information can be
obtained via fitk$centers[fitk$cluster,]
. Set type="points"
in
the latter case.
The function Rc
attempts to compute all missing information, so
computation will take the longer the less informative the given
object x
is. Note also, Rc
looks up the option scaled
in the fitted
object, and accounts for the scaling automatically. Important: If the data
were scaled, then do NOT unscale the results by hand in order to feed
the unscaled version into base.Rc
, this will give a wrong result.
In terms of methodology, these functions compute \(R_C\) directly through the mean reduction of absolute residual length, rather than through the area above the coverage curve.
These functions do currently not account for observation weights, i.e. \(R_C\) is computed through the unweighted mean reduction in absolute residual length (even if weights have been used for the curve fitting).
In the clustering context, a value of \(R_C=0.8\) means that, after the clustering, the mean absolute residual length has been reduced by \(80\%\) (compared to the distances to the overall mean).
Einbeck, Tutz, and Evers (2005). Local principal curves. Statistics and Computing 15, 301-313.
Einbeck (2011). Bandwidth selection for nonparametric unsupervised learning techniques -- a unified approach via self-coverage. Journal of Pattern Recognition Research 6, 175-192.
lpc.spline
, ms
, coverage
.
data(calspeedflow)
lpc1 <- lpc.spline(lpc(calspeedflow[,3:4]), project=TRUE)
Rc(lpc1)
# \donttest{
# is the same as:
base.Rc(lpc1$lpcobject$data, lpc1$closest.coords)
# }
# \donttest{
ms1 <- ms(calspeedflow[,3:4], plot=FALSE)
Rc(ms1)
# is the same as:
base.Rc(ms1$data, ms1$cluster.center[ms1$closest.label,], type="points")
# }
Run the code above in your browser using DataLab