describe: Concise Statistical Description of a Vector, Matrix, Data Frame, or Formula

Description

describe is a generic method that invokes describe.data.frame, describe.matrix, describe.vector, or describe.formula. describe.vector is the basic function for handling a single variable. This function determines whether the variable is character, factor, category, binary, discrete numeric, and continuous numeric, and prints a concise statistical summary according to each. A numeric variable is deemed discrete if it has <= 10 distinct values. In this case, quantiles are not printed. A frequency table is printed for any non-binary variable if it has no more than 20 distinct values. For any variable for which the frequency table is not printed, the 5 lowest and highest values are printed. This behavior can be overriden for long character variables with many levels using the listunique parameter, to get a complete tabulation.

describe is especially useful for describing data frames created by *.get, as labels, formats, value labels, and (in the case of sas.get) frequencies of special missing values are printed.

For a binary variable, the sum (number of 1's) and mean (proportion of 1's) are printed. If the first argument is a formula, a model frame is created and passed to describe.data.frame. If a variable is of class "impute", a count of the number of imputed values is printed. If a date variable has an attribute partial.date (this is set up by sas.get), counts of how many partial dates are actually present (missing month, missing day, missing both) are also presented. If a variable was created by the special-purpose function substi (which substitutes values of a second variable if the first variable is NA), the frequency table of substitutions is also printed.

For numeric variables, describe adds an item called Info which is a relative information measure using the relative efficiency of a proportional odds/Wilcoxon test on the variable relative to the same test on a variable that has no ties. Info is related to how continuous the variable is, and ties are less harmful the more untied values there are. The formula for Info is one minus the sum of the cubes of relative frequencies of values divided by one minus the square of the reciprocal of the sample size. The lowest information comes from a variable having only one distinct value following by a highly skewed binary variable. Info is reported to two decimal places.

A latex method exists for converting the describe object to a LaTeX file. For numeric variables having at least 20 distinct values, describe saves in its returned object the frequencies of 100 evenly spaced bins running from minimum observed value to the maximum. latex inserts a spike histogram displaying these frequency counts in the tabular material using the LaTeX picture environment. For example output see http://biostat.mc.vanderbilt.edu/wiki/pub/Main/Hmisc/counties.pdf. Note that the latex method assumes you have the following styles installed in your latex installation: setspace and relsize.

The html method mimics the LaTeX output except for not including the spike histogram. This is useful in the context of Rmarkdown html and html notebook output.

The plot method is for describe objects run on data frames. It produces spike histograms for a graphic of continuous variables and a dot chart for categorical variables, showing category proportions. The graphic format is ggplot2 if the user has not set options(grType='plotly') or has set the grType option to something other than 'plotly'. Otherwise plotly graphics that are interactive are produced, and these can be placed into an Rmarkdown html notebook. The user must install the plotly package for this to work. When the use hovers the mouse over a bin for a raw data value, the actual value will pop-up (formatted using digits). When the user hovers over the minimum data value, most of the information calculated by describe will pop up. For each variable, the number of missing values is used to assign the color to the histogram or dot chart, and a legend is drawn. Color is not used if there are no missing values in any variable. For categorical variables, hovering over the leftmost point for a variable displays details, and for all points proportions, numerators, and denominators are displayed in the popup. If both continuous and categorical variables are present and which='both' is specified, the plot method returns an unclassed list containing two objects, named 'Categorical' and 'Continuous', in that order.

Sample weights may be specified to any of the functions, resulting in weighted means, quantiles, and frequency tables.

Note: As discussed in Cox and Longton (2008), Stata Technical Bulletin 8(4) pp. 557, the term "unique" has been replaced with "distinct" in the output (but not in parameter names).

When weights are not used, Gini's mean difference is computed for numeric variables. This is a robust measure of dispersion that is the mean absolute difference between any pairs of observations. In the output Gini's difference is labeled Gmd.

formatdescribeSingle is a service function for latex, html, and print methods for single variables that is not intended to be called by the user.

Usage

# S3 method for vector
describe(x, descript, exclude.missing=TRUE, digits=4,
         listunique=0, listnchar=12,
         weights=NULL, normwt=FALSE, minlength=NULL, …)
# S3 method for matrix
describe(x, descript, exclude.missing=TRUE, digits=4, …)
# S3 method for data.frame
describe(x, descript, exclude.missing=TRUE,
    digits=4, …)
# S3 method for formula
describe(x, descript, data, subset, na.action,
    digits=4, weights, …)
# S3 method for describe
print(x, …)
# S3 method for describe
latex(object, title=NULL,
      file=paste('describe',first.word(expr=attr(object,'descript')),'tex',sep='.'),
      append=FALSE, size='small', tabular=TRUE, greek=TRUE,
      spacing=0.7, lspace=c(0,0), …)
# S3 method for describe.single
latex(object, title=NULL, vname,
      file, append=FALSE, size='small', tabular=TRUE, greek=TRUE,
      lspace=c(0,0), …)
# S3 method for describe
html(object, size=85, tabular=TRUE,
      greek=TRUE, scroll=FALSE, rows=25, cols=100, …)
# S3 method for describe.single
html(object, size=85,
      tabular=TRUE, greek=TRUE, …)
formatdescribeSingle(x, condense=c('extremes', 'frequencies', 'both', 'none'),
           lang=c('plain', 'latex', 'html'), verb=0, lspace=c(0, 0),
           size=85, …)
# S3 method for describe
plot(x, which=c('both', 'continuous', 'categorical'),
                          what=NULL,
                          sort=c('ascending', 'descending', 'none'),
                          n.unique=10, digits=5, …)

Arguments

a data frame, matrix, vector, or formula. For a data frame, the describe.data.frame function is automatically invoked. For a matrix, describe.matrix is called. For a formula, describe.data.frame(model.frame(x)) is invoked. The formula may or may not have a response variable. For print, latex, html, or formatdescribeSingle, x is an object created by describe.

descript

optional title to print for x. The default is the name of the argument or the "label" attributes of individual variables. When the first argument is a formula, descript defaults to a character representation of the formula.

exclude.missing

set toTRUE to print the names of variables that contain only missing values. This list appears at the bottom of the printout, and no space is taken up for such variables in the main listing.

digits

number of significant digits to print. For plot.describe is the number of significant digits to put in hover text for plotly when showing raw variable values.

listunique

For a character variable that is not an mChoice variable, that has its longest string length greater than listnchar, and that has no more than listunique distinct values, all values are listed in alphabetic order. Any value having more than one occurrence has the frequency of occurrence after it, in parentheses. Specify listunique equal to some value at least as large as the number of observations to ensure that all character variables will have all their values listed. For purposes of tabulating character strings, multiple white spaces of any kind are translated to a single space, leading and trailing white space are ignored, and case is ignored.

listnchar

see listunique

weights

a numeric vector of frequencies or sample weights. Each observation will be treated as if it were sampled weights times.

minlength

value passed to summary.mChoice.

normwt

The default, normwt=FALSE results in the use of weights as weights in computing various statistics. In this case the sample size is assumed to be equal to the sum of weights. Specify normwt=TRUE to divide weights by a constant so that weights sum to the number of observations (length of vectors specified to describe). In this case the number of observations is taken to be the actual number of records given to describe.

object

a result of describe

title

unused

data

subset

na.action

These are used if a formula is specified. na.action defaults to na.retain which does not delete any NAs from the data frame. Use na.action=na.omit or na.delete to drop any observation with any NA before processing.

…

arguments passed to describe.default which are passed to calls to format for numeric variables. For example if using R POSIXct or Date date/time formats, specifying describe(d,format='%d%b%y') will print date/time variables as "01Jan2000". This is useful for omitting the time component. See the help file for format.POSIXct or format.Date for more information. For plot methods, … is ignored. For html and latex methods, … is used to pass optional arguments to formatdescribeSingle, especially the condense argument.

file

name of output file (should have a suffix of .tex). Default name is formed from the first word of the descript element of the describe object, prefixed by "describe". Set file="" to send LaTeX code to standard output instead of a file.

append

set to TRUE to have latex append text to an existing file named file

size

LaTeX text size ("small", the default, or "normalsize", "tiny", "scriptsize", etc.) for the describe output in LaTeX. For html is the percent of the prevailing font size to use for the output.

tabular

set to FALSE to use verbatim rather than tabular (or html table) environment for the summary statistics output. By default, tabular is used if the output is not too wide.

greek

By default, the latex and html methods will change names of greek letters that appear in variable labels to appropriate LaTeX symbols in math mode, or html symbols, unless greek=FALSE.

spacing

By default, the latex method for describe run on a matrix or data frame uses the setspace LaTeX package with a line spacing of 0.7 so as to no waste space. Specify spacing=0 to suppress the use of the setspace's spacing environment, or specify another positive value to use this environment with a different spacing.

lspace

extra vertical scape, in character size units (i.e., "ex" as appended to the space). When using certain font sizes, there is too much space left around LaTeX verbatim environments. This two-vector specifies space to remove (i.e., the values are negated in forming the vspace command) before (first element) and after (second element of lspace) verbatims

scroll

set to TRUE to create an html scrollable box for the html output

rows, cols

the number of rows or columns to allocate for the scrollable box

vname

unused argument in latex.describe.single

which

specifies whether to plot numeric continuous or binary/categorical variables, or both. When "both" a list with two elements is created. Each element is a ggplot2 or plotly object. If there are no variables of a given type, a single ggplot2 or plotly object is returned, ready to print.

what

character or numeric vector specifying which variables to plot; default is to plot all

sort

specifies how and whether variables are sorted in order of the proportion of positives when which="categorical". Specify sort="none" to leave variables in the order they appear in the original data.

n.unique

the minimum number of distinct values a numeric variable must have before plot.describe uses it in a continuous variable plot

condense

specifies whether to condense the output with regard to the 5 lowest and highest values ("extremes") and the frequency table

lang

specifies the markup language

verb

set to 1 if a verbatim environment is already in effect for LaTeX

Value

a list containing elements descript, counts, values. The list is of class describe. If the input object was a matrix or a data frame, the list is a list of lists, one list for each variable analyzed. latex returns a standard latex object. For numeric variables having at least 20 distinct values, an additional component intervalFreq. This component is a list with two elements, range (containing two values) and count, a vector of 100 integer frequency counts.

Details

If options(na.detail.response=TRUE) has been set and na.action is "na.delete" or "na.keep", summary statistics on the response variable are printed separately for missing and non-missing values of each predictor. The default summary function returns the number of non-missing response values and the mean of the last column of the response values, with a names attribute of c("N","Mean"). When the response is a Surv object and the mean is used, this will result in the crude proportion of events being used to summarize the response. The actual summary function can be designated through options(na.fun.response = "function name").

If you are modifying LaTex parskip or certain other parameters, you may need to shrink the area around tabular and verbatim environments produced by latex.describe. You can do this using for example \usepackage{etoolbox}\makeatletter\preto{\@verbatim}{\topsep=-1.4pt \partopsep=0pt}\preto{\@tabular}{\parskip=2pt \parsep=0pt}\makeatother in the LaTeX preamble.

Examples

Run this code

# NOT RUN {
set.seed(1)
describe(runif(200),dig=2)    #single variable, continuous
                              #get quantiles .05,.10,\dots

dfr <- data.frame(x=rnorm(400),y=sample(c('male','female'),400,TRUE))
describe(dfr)

# }
# NOT RUN {
options(grType='plotly')
d <- describe(mydata)
p <- plot(d)   # create plots for both types of variables
p[[1]]; p[[2]] # or p$Categorical; p$Continuous
plotly::subplot(p[[1]], p[[2]], nrows=2)  # plot both in one
plot(d, which='categorical')    # categorical ones

d <- sas.get(".","mydata",special.miss=TRUE,recode=TRUE)
describe(d)      #describe entire data frame
attach(d, 1)
describe(relig)  #Has special missing values .D .F .M .R .T
                 #attr(relig,"label") is "Religious preference"

#relig : Religious preference  Format:relig
#    n missing  D  F M R T distinct 
# 4038     263 45 33 7 2 1        8
#
#0:none (251, 6%), 1:Jewish (372, 9%), 2:Catholic (1230, 30%) 
#3:Jehovah's Witnes (25, 1%), 4:Christ Scientist (7, 0%) 
#5:Seventh Day Adv (17, 0%), 6:Protestant (2025, 50%), 7:other (111, 3%) 


# Method for describing part of a data frame:
 describe(death.time ~ age*sex + rcs(blood.pressure))
 describe(~ age+sex)
 describe(~ age+sex, weights=freqs)  # weighted analysis

 fit <- lrm(y ~ age*sex + log(height))
 describe(formula(fit))
 describe(y ~ age*sex, na.action=na.delete)   
# report on number deleted for each variable
 options(na.detail.response=TRUE)  
# keep missings separately for each x, report on dist of y by x=NA
 describe(y ~ age*sex)
 options(na.fun.response="quantile")
 describe(y ~ age*sex)   # same but use quantiles of y by x=NA

 d <- describe(my.data.frame)
 d$age                   # print description for just age
 d[c('age','sex')]       # print description for two variables
 d[sort(names(d))]       # print in alphabetic order by var. names
 d2 <- d[20:30]          # keep variables 20-30
 page(d2)                # pop-up window for these variables

# Test date/time formats and suppression of times when they don't vary
 library(chron)
 d <- data.frame(a=chron((1:20)+.1),
                 b=chron((1:20)+(1:20)/100),
                 d=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20,
                               hour=rep(11,20),min=rep(17,20),sec=rep(11,20)),
                 f=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20,
                               hour=1:20,min=1:20,sec=1:20),
                 g=ISOdate(year=2001:2020,month=rep(3,20),day=1:20))
 describe(d)

# Make a function to run describe, latex.describe, and use the kdvi
# previewer in Linux to view the result and easily make a pdf file

 ldesc <- function(data) {
  options(xdvicmd='kdvi')
  d <- describe(data, desc=deparse(substitute(data)))
  dvi(latex(d, file='/tmp/z.tex'), nomargins=FALSE, width=8.5, height=11)
 }

 ldesc(d)
# }

Run the code above in your browser using DataLab