approxQuantile: Calculates the approximate quantiles of a numerical column of a SparkDataFrame
Description
Calculates the approximate quantiles of a numerical column of a SparkDataFrame.
The result of this algorithm has the following deterministic bound:
If the SparkDataFrame has N elements and if we request the quantile at probability p up to
error err, then the algorithm will return a sample x from the SparkDataFrame so that the
*exact* rank of x is close to (p * N). More precisely,
floor((p - err) * N) <= rank(x) <= ceil((p + err) * N).
This method implements a variation of the Greenwald-Khanna algorithm (with some speed
optimizations). The algorithm was first present in [[http://dx.doi.org/10.1145/375663.375670
Space-efficient Online Computation of Quantile Summaries]] by Greenwald and Khanna.
Usage
# S4 method for SparkDataFrame,character,numeric,numeric
approxQuantile(x, col,
probabilities, relativeError)
Arguments
x
A SparkDataFrame.
col
The name of the numerical column.
probabilities
A list of quantile probabilities. Each number must belong to [0, 1].
For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
relativeError
The relative target precision to achieve (>= 0). If set to zero,
the exact quantiles are computed, which could be very expensive.
Note that values greater than 1 are accepted but give the same result as 1.
Value
The approximate quantiles at the given probabilities.