Extremes: Kth Smallest/Largest Values

Description

Find the kth smallest, resp. largest values from a vector x and return the values and their frequencies.

Usage

Small(x, k = 5, unique = FALSE, na.last = NA)
Large(x, k = 5, unique = FALSE, na.last = NA)
HighLow(x, nlow = 5, nhigh = nlow, na.last = NA)

Value

if unique is set to FALSE: a vector with the k most extreme values,

else: a list, containing the k most extreme values and their frequencies.

Arguments

x: a numeric vector
k: an integer >0 defining how many extreme values should be returned. Default is k = 5. If k > length(x), all values will be returned.
unique: logical, defining if unique values should be considered or not. If this is set to TRUE, a list with the k extreme values and their frequencies is returned. Default is FALSE (as unique is a rather expensive function).
na.last: for controlling the treatment of NAs. If TRUE, missing values in the data are put last; if FALSE, they are put first; if NA, they are removed.
nlow: a single integer. The number of the smallest elements of a vector to be printed. Defaults to 5.
nhigh: a single integer. The number of the greatest elements of a vector to be printed. Defaults to the number of nlow.

Author

Andri Signorell <andri@signorell.net>
C++ parts by Nathan Russell and Romain Francois

Details

This does not seem to be a difficult problem at first sight. We could simply tabulate and sort the vector and finally take the first or last k values. However sorting and tabulating the whole vector when we're just interested in the few smallest values is a considerable waste of resources. This approach becomes already impracticable for medium vector lengths (~10⁵). There are several points and solutions of this problem discussed out there. The present implementation is based on highly efficient C++ code and proved to be very fast.

HighLow combines the two upper functions and reports the k extreme values on both sides together with their frequencies in parentheses. It is used for describing univariate variables and is interesting for checking the ends of the vector, where in real data often wrong values accumulate. This is in essence a printing routine for the highest and the lowest values of x.

References

https://stackoverflow.com/questions/36993935/find-the-largest-n-unique-values-and-their-frequencies-in-r-and-rcpp/

https://gallery.rcpp.org/articles/top-elements-from-vectors-using-priority-queue/

Examples

Run this code

x <- sample(1:10, 1000, rep=TRUE)
Large(x, 3)
Large(x, k=3, unique=TRUE)

# works fine up to x ~ 1e6
x <- runif(1000000)
Small(x, 3, unique=TRUE)
Small(x, 3, unique=FALSE)

# Both ends
cat(HighLow(d.pizza$temperature, na.last=NA))

Run the code above in your browser using DataLab