Learn R Programming

collapse (version 1.8.9)

descr: Detailed Statistical Description of Data Frame

Description

descr offers a concise description of each variable in a data frame. It is built as a wrapper around qsu, but also computes frequency tables for categorical variables, and quantiles and the number of distinct values for numeric variables.

Usage

descr(X, Ndistinct = TRUE, higher = TRUE, table = TRUE, sort.table = "freq",
      Qprobs = c(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99),
      cols = NULL, label.attr = "label", stepwise = FALSE, ...)

# S3 method for descr [(x, ...)

# S3 method for descr as.data.frame(x, ...)

# S3 method for descr print(x, n = 14, perc = TRUE, digits = 2, t.table = TRUE, summary = TRUE, reverse = FALSE, stepwise = FALSE, ...)

Value

A 2-level nested list, the top-level containing the statistics computed for each variable, which are themselves stored in a list containing the class, the label, the basic statistics and quantiles / tables computed for the variable. The object is given a class 'descr' and also has the number of observations in the dataset attached as an 'N' attribute, as well as an attribute 'arstat' indicating whether the object contains arrays of statistics, and an attribute 'table' indicating whether table = TRUE (i.e. the object could contain tables for categorical variables).

Arguments

X

a data frame or list of atomic vectors. Atomic vectors, matrices or arrays can be passed but will first be coerced to data frame using qDF.

Ndistinct

logical. TRUE (default) computes the number of distinct values on all variables using fndistinct.

higher

logical. Argument is passed down to qsu: TRUE (default) computes the skewness and the kurtosis.

table

logical. TRUE (default) computes a (sorted) frequency table for all categorical variables (excluding Date variables).

sort.table

an integer or character string specifying how the frequency table should be presented:

Int. String Description
1"value"sort table by values.
2"freq"sort table by frequencies.
3"none"return table in first-appearance order of values, or levels for factors (most efficient).

Qprobs

double. Probabilities for quantiles to compute on numeric variables, passed down to quantile. If something non-numeric is passed (i.e. NULL, FALSE, NA, "" etc.), no quantiles are computed.

cols

select columns to describe using column names, indices, a logical vector or a function (e.g. is.numeric).

label.attr

character. The name of a label attribute to display for each variable (if variables are labeled).

...

for descr: other arguments passed to qsu.default. For [.descr: variable names or indices passed to [.list. The argument is unused in the print and as.data.frame methods.

x

an object of class 'descr'.

n

integer. The maximum number of table elements to print for categorical variables. If the number of distinct elements is <= n, the whole table is printed. Otherwise the remaining items are grouped into an '... %s Others' category.

perc

logical. TRUE (default) adds percentages to the frequencies in the table for categorical variables.

digits

integer. The number of decimals to print in statistics and percentage tables.

t.table

logical. TRUE (default) prints a transposed table.

summary

logical. TRUE (default) computes and displays a summary of the frequencies, if the size of the table for a categorical variable exceeds n.

reverse

logical. TRUE prints contents in reverse order, starting with the last column, so that the dataset can be analyzed by scrolling up the console after calling descr.

stepwise

logical. TRUE prints one variable at a time. The user needs to press [enter] to see the printout for the next variable. If called from descr, the computation is also done one variable at a time, and the finished 'descr' object is returned invisibly. This is recommended for larger datasets, where precomputing the statistics for all variables can be time consuming.

Details

descr was heavily inspired by Hmisc::describe, but computes about 10x faster. The performance is comparable to summary. descr was built as a wrapper around qsu, to enrich the set of statistics computed by qsu for both numeric and categorical variables.

qsu itself is yet about 10x faster than descr, and is optimized for grouped, panel data and weighted statistics. It is possible to also compute grouped, panel data and/or weighted statistics with descr by passing group-ids to g, panel-ids to pid or a weight vector to w. These arguments are handed down to qsu.default and only affect the statistics natively computed by qsu, i.e. passing a weight vector produces a weighted mean, sd, skewness and kurtosis but not weighted quantiles.

The list-object returned from descr can be converted to a tidy data frame using as.data.frame. This representation will not include frequency tables computed for categorical variables, and the method cannot handle arrays of statistics (applicable when g or pid arguments are passed to descr, in that case as.data.frame.descr will throw an appropriate error).

See Also

qsu, pwcor, Summary Statistics, Fast Statistical Functions, Collapse Overview

Examples

Run this code
## Standard Use
descr(iris)
descr(wlddev)
descr(GGDC10S)

# Some useful print options (also try stepwise argument)
print(descr(GGDC10S), reverse = TRUE, t.table = FALSE)
# For bigger data consider: descr(big_data, stepwise = TRUE)

# Generating a data frame
as.data.frame(descr(wlddev, table = FALSE))

## Passing Arguments down to qsu.default: For Panel Data Statistics
descr(iris, pid = iris$Species)
descr(wlddev, pid = wlddev$iso3c)

## Grouped Statistics
descr(iris, g = iris$Species)
descr(GGDC10S, g = GGDC10S$Region)

Run the code above in your browser using DataLab