describe
is a generic method that invokes describe.data.frame
,
describe.matrix
, describe.vector
, or
describe.formula
. describe.vector
is the basic
function for handling a single variable.
This function determines whether the variable is character, factor,
category, binary, discrete numeric, and continuous numeric, and prints
a concise statistical summary according to each. A numeric variable is
deemed discrete if it has <= 10 distinct values. In this case,
quantiles are not printed. A frequency table is printed
for any non-binary variable if it has no more than 20 distinct
values. For any variable for which the frequency table is not printed,
the 5 lowest and highest values are printed. This behavior can be
overriden for long character variables with many levels using the
listunique
parameter, to get a complete tabulation.
describe
is especially useful for
describing data frames created by *.get
, as labels, formats,
value labels, and (in the case of sas.get
) frequencies of special
missing values are printed.
For a binary variable, the sum (number of 1's) and mean (proportion of
1's) are printed. If the first argument is a formula, a model frame
is created and passed to describe.data.frame. If a variable
is of class "impute"
, a count of the number of imputed values is
printed. If a date variable has an attribute partial.date
(this is set up by sas.get
), counts of how many partial dates are
actually present (missing month, missing day, missing both) are also presented.
If a variable was created by the special-purpose function substi
(which
substitutes values of a second variable if the first variable is NA),
the frequency table of substitutions is also printed.
For numeric variables, describe
adds an item called Info
which is a relative information measure using the relative efficiency of
a proportional odds/Wilcoxon test on the variable relative to the same
test on a variable that has no ties. Info
is related to how
continuous the variable is, and ties are less harmful the more untied
values there are. The formula for Info
is one minus the sum of
the cubes of relative frequencies of values divided by one minus the
square of the reciprocal of the sample size. The lowest information
comes from a variable having only one distinct value following by a
highly skewed binary variable. Info
is reported to
two decimal places.
A latex method exists for converting the describe
object to a
LaTeX file. For numeric variables having more than 20 distinct values,
describe
saves in its returned object the frequencies of 100
evenly spaced bins running from minimum observed value to the maximum.
When there are less than or equal to 20 distinct values, the original
values are maintained.
latex
and html
insert a spike histogram displaying these
frequency counts in the tabular material using the LaTeX picture
environment. For example output see
https://hbiostat.org/doc/rms/book/chapter7edition1.pdf.
Note that the latex method assumes you have the following styles
installed in your latex installation: setspace and relsize.
The html
method mimics the LaTeX output. This is useful in the
context of Quarto/Rmarkdown html and html notebook output.
If options(prType='html')
is in effect, calling print
on
an object that is the result of running describe
on a data frame
will result in rendering the HTML version. If run from the console a
browser window will open. When which
is specified to
print
, whether or not prType='html'
is in effect, a
gt
package html table will be produced containing only
the types of variables requested. When which='both'
a list with
element names Continuous
and Categorical
is produced,
making it convenient for the user to print as desired, or to pass the
list directed to the qreport
maketabs
function when using Quarto.
The plot
method is for describe
objects run on data
frames. It produces spike histograms for a graphic of
continuous variables and a dot chart for categorical variables, showing
category proportions. The graphic format is ggplot2
if the user
has not set options(grType='plotly')
or has set the grType
option to something other than 'plotly'
. Otherwise plotly
graphics that are interactive are produced, and these can be placed into
an Rmarkdown html notebook. The user must install the plotly
package for this to work. When the use hovers the mouse over a bin for
a raw data value, the actual value will pop-up (formatted using
digits
). When the user hovers over the minimum data value, most
of the information calculated by describe
will pop up. For each
variable, the number of missing values is used to assign the color to
the histogram or dot chart, and a legend is drawn. Color is not used if
there are no missing values in any variable. For categorical variables,
hovering over the leftmost point for a variable displays details, and
for all points proportions, numerators, and denominators are displayed
in the popup. If both continuous and categorical variables are present
and which='both'
is specified, the plot
method returns an
unclassed list
containing two objects, named 'Categorical'
and 'Continuous'
, in that order.
Sample weights may be specified to any of the functions, resulting
in weighted means, quantiles, and frequency tables.
Note: As discussed in Cox and Longton (2008), Stata Technical Bulletin 8(4)
pp. 557, the term "unique" has been replaced with "distinct" in the
output (but not in parameter names).
When weights
are not used, Gini's mean difference is computed for
numeric variables. This is a robust measure of dispersion that is the
mean absolute difference between any pairs of observations. In simple
output Gini's difference is labeled Gmd
.
formatdescribeSingle
is a service function for latex
,
html
, and print
methods for single variables that is not
intended to be called by the user.