Last chance! 50% off unlimited learning
Sale ends in
describe
is a generic method that invokes describe.data.frame
,
describe.matrix
, describe.vector
, or
describe.formula
. describe.vector
is the basic
function for handling a single variable.
This function determines whether the variable is character, factor,
category, binary, discrete numeric, and continuous numeric, and prints
a concise statistical summary according to each. A numeric variable is
deemed discrete if it has <= 5="" 10="" 20="" distinct="" values.="" in="" this="" case,="" quantiles="" are="" not="" printed.="" a="" frequency="" table="" is="" printed="" for="" any="" non-binary="" variable="" if="" it="" has="" no="" more="" than="" which="" the="" printed,="" lowest="" and="" highest="" values="" behavior="" can="" be="" overriden="" long="" character="" variables="" with="" many="" levels="" using="" listunique parameter, to get a complete tabulation.describe
is especially useful for
describing data frames created by *.get
, as labels, formats,
value labels, and (in the case of sas.get
) frequencies of special
missing values are printed.
For a binary variable, the sum (number of 1's) and mean (proportion of
1's) are printed. If the first argument is a formula, a model frame
is created and passed to describe.data.frame. If a variable
is of class "impute"
, a count of the number of imputed values is
printed. If a date variable has an attribute partial.date
(this is set up by sas.get
), counts of how many partial dates are
actually present (missing month, missing day, missing both) are also presented.
If a variable was created by the special-purpose function substi
(which
substitutes values of a second variable if the first variable is NA),
the frequency table of substitutions is also printed.
For numeric variables, describe
adds an item called Info
which is a relative information measure using the relative efficiency of
a proportional odds/Wilcoxon test on the variable relative to the same
test on a variable that has no ties. Info
is related to how
continuous the variable is, and ties are less harmful the more untied
values there are. The formula for Info
is one minus the sum of
the cubes of relative frequencies of values divided by one minus the
square of the reciprocal of the sample size. The lowest information
comes from a variable having only one distinct value following by a
highly skewed binary variable. Info
is reported to
two decimal places.
A latex method exists for converting the describe
object to a
LaTeX file. For numeric variables having at least 20 distinct values,
describe
saves in its returned object the frequencies of 100
evenly spaced bins running from minimum observed value to the maximum.
latex
inserts a spike histogram displaying these frequency counts
in the tabular material using the LaTeX picture environment. For
example output see
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/Hmisc/counties.pdf.
Note that the latex method assumes you have the following styles
installed in your latex installation: setspace and relsize.
The html
method mimics the LaTeX output except for not including
the spike histogram. This is useful in the context of Rmarkdown html
and html notebook output.
The plot
method is for describe
objects run on data
frames. It produces spike histograms for a graphic of
continuous variables and a dot chart for categorical variables, showing
category proportions. The graphic format is ggplot2
if the user
has not set options(grType='plotly')
or has set the grType
option to something other than 'plotly'
. Otherwise plotly
graphics that are interactive are produced, and these can be placed into
an Rmarkdown html notebook. The user must install the plotly
package for this to work. When the use hovers the mouse over a bin for
a raw data value, the actual value will pop-up (formatted using
digits
). When the user hovers over the minimum data value, most
of the information calculated by describe
will pop up. For each
variable, the number of missing values is used to assign the color to
the histogram or dot chart, and a legend is drawn. Color is not used if
there are no missing values in any variable. For categorical variables,
hovering over the leftmost point for a variable displays details, and
for all points proportions, numerators, and denominators are displayed
in the popup. If both continuous and categorical variables are present
and which='both'
is specified, the plot
method returns an
unclassed list
containing two objects, named 'Categorical'
and 'Continuous'
, in that order.
Sample weights may be specified to any of the functions, resulting
in weighted means, quantiles, and frequency tables.
Note: As discussed in Cox and Longton (2008), Stata Technical Bulletin 8(4)
pp. 557, the term "unique" has been replaced with "distinct" in the
output (but not in parameter names).
When weights
are not used, Gini's mean difference is computed for
numeric variables. This is a robust measure of dispersion that is the
mean absolute difference between any pairs of observations. In the
output Gini's difference is labeled Gmd
.
formatdescribeSingle
is a service function for latex
,
html
, and print
methods for single variables that is not
intended to be called by the user.
"describe"(x, descript, exclude.missing=TRUE, digits=4, listunique=0, listnchar=12, weights=NULL, normwt=FALSE, minlength=NULL, ...)
"describe"(x, descript, exclude.missing=TRUE, digits=4, ...)
"describe"(x, descript, exclude.missing=TRUE, digits=4, ...)
"describe"(x, descript, data, subset, na.action, digits=4, weights, ...)
"print"(x, ...)
"latex"(object, title=NULL, file=paste('describe',first.word(expr=attr(object,'descript')),'tex',sep='.'), append=FALSE, size='small', tabular=TRUE, greek=TRUE, spacing=0.7, lspace=c(0,0), ...)
"latex"(object, title=NULL, vname, file, append=FALSE, size='small', tabular=TRUE, greek=TRUE, lspace=c(0,0), ...)
"html"(object, size=85, tabular=TRUE, greek=TRUE, scroll=FALSE, rows=25, cols=100, ...)
"html"(object, vname, size=85, tabular=TRUE, greek=TRUE, ...)
formatdescribeSingle(x, condense=c('extremes', 'frequencies', 'both', 'none'), lang=c('plain', 'latex', 'html'), verb=0, lspace=c(0, 0), size=85, ...)
"plot"(x, which=c('both', 'continuous', 'categorical'), what=NULL, sort=c('ascending', 'descending', 'none'), n.unique=10, digits=5, ...)
describe.data.frame
function is automatically invoked. For a matrix, describe.matrix
is
called. For a formula, describe.data.frame(model.frame(x))
is invoked. The formula may or may not have a response variable. For
print
, latex
, html
, or
formatdescribeSingle
, x
is an object created by
describe
.
descript
defaults to a character representation of
the formula.
plot.describe
is
the number of significant digits to put in hover text for
plotly
when showing raw variable values.mChoice
variable, that
has its longest string length greater than listnchar
, and that
has no more than listunique
distinct values, all values are
listed in alphabetic order. Any value having more than one occurrence
has the frequency of occurrence after it, in parentheses. Specify
listunique
equal to some value at least as large as the number
of observations to ensure that all character variables will have all
their values listed. For purposes of tabulating character strings,
multiple white spaces of any kind are translated to a single space,
leading and trailing white space are ignored, and case is ignored.
listunique
weights
times.
normwt=FALSE
results in the use of weights
as
weights in computing various statistics. In this case the sample size
is assumed to be equal to the sum of weights
. Specify
normwt=TRUE
to divide
weights
by a constant so that weights
sum to the number of
observations (length of vectors specified to describe
). In this
case the number of observations is taken to be the actual number of
records given to describe
.
describe
na.action
defaults to
na.retain
which does not delete any NA
s from the data frame.
Use na.action=na.omit
or na.delete
to drop any observation with
any NA
before processing.
describe.default
which are passed to calls
to format
for numeric variables. For example if using R
POSIXct
or Date
date/time formats, specifying
describe(d,format='%d%b%y')
will print date/time variables as
"01Jan2000"
. This is useful for omitting the time
component. See the help file for format.POSIXct
or
format.Date
for more
information. For plot
methods, ... is ignored.
For html
and latex
methods, ... is used to pass
optional arguments to formatdescribeSingle
, especially the
condense
argument.
descript
element of the
describe
object, prefixed by "describe"
. Set
file=""
to send LaTeX code to standard output instead of a file.
TRUE
to have latex
append text to an existing file
named file
"small"
, the default, or "normalsize"
,
"tiny"
, "scriptsize"
, etc.) for the describe
output
in LaTeX. For html is the percent of the prevailing font size to use for
the output.
FALSE
to use verbatim rather than tabular (or html
table) environment for the summary statistics output. By default,
tabular is used if the output is not too wide.latex
and html
methods
will change names of greek letters that appear in variable
labels to appropriate LaTeX symbols in math mode, or html symbols, unless
greek=FALSE
.latex
method for describe
run
on a matrix or data frame uses the setspace
LaTeX package with a
line spacing of 0.7 so as to no waste space. Specify spacing=0
to suppress the use of the setspace
's spacing
environment,
or specify another positive value to use this environment with a
different spacing.vspace
command) before (first element) and after
(second element of lspace
) verbatimsTRUE
to create an html scrollable box for
the html outputlatex.describe.single
. For html
is used to pass the current variable name"both"
a list with
two elements is created. Each element is a ggplot2
or
plotly
object. If
there are no variables of a given type, a single ggplot2
or
plotly
object is returned, ready to print.which="categorical"
. Specify
sort="none"
to leave variables in the order they appear in the
original data.plot.describe
uses it in a continuous variable
plot"extremes"
) and the frequency table
descript
, counts
,
values
. The list is of class describe
. If the input
object was a matrix or a data
frame, the list is a list of lists, one list for each variable
analyzed. latex
returns a standard latex
object. For numeric
variables having at least 20 distinct values, an additional component
intervalFreq
. This component is a list with two elements, range
(containing two values) and count
, a vector of 100 integer frequency
counts.
options(na.detail.response=TRUE)
has been set and na.action
is "na.delete"
or
"na.keep"
, summary statistics on
the response variable are printed separately for missing and non-missing
values of each predictor. The default summary function returns
the number of non-missing response values and the mean of the last
column of the response values, with a names
attribute of
c("N","Mean")
.
When the response is a Surv
object and the mean is used, this will
result in the crude proportion of events being used to summarize
the response. The actual summary function can be designated through
options(na.fun.response = "function name")
.If you are modifying LaTex parskip
or certain other parameters,
you may need to shrink the area around tabular
and
verbatim
environments produced by latex.describe
. You can
do this using for example
\usepackage{etoolbox}\makeatletter\preto{\@verbatim}{\topsep=-1.4pt
\partopsep=0pt}\preto{\@tabular}{\parskip=2pt
\parsep=0pt}\makeatother
in the LaTeX preamble.
sas.get
, quantile
, GiniMd
,
table
, summary
,
model.frame.default
,
naprint
, lapply
, tapply
,
Surv
, na.delete
,
na.keep
,
na.detail.response
, latex
set.seed(1)
describe(runif(200),dig=2) #single variable, continuous
#get quantiles .05,.10,\dots
dfr <- data.frame(x=rnorm(400),y=sample(c('male','female'),400,TRUE))
describe(dfr)
## Not run:
# options(grType='plotly')
# d <- describe(mydata)
# p <- plot(d) # create plots for both types of variables
# p[[1]]; p[[2]] # or p$Categorical; p$Continuous
# plotly::subplot(p[[1]], p[[2]], nrows=2) # plot both in one
# plot(d, which='categorical') # categorical ones
#
# d <- sas.get(".","mydata",special.miss=TRUE,recode=TRUE)
# describe(d) #describe entire data frame
# attach(d, 1)
# describe(relig) #Has special missing values .D .F .M .R .T
# #attr(relig,"label") is "Religious preference"
#
# #relig : Religious preference Format:relig
# # n missing D F M R T distinct
# # 4038 263 45 33 7 2 1 8
# #
# #0:none (251, 6%), 1:Jewish (372, 9%), 2:Catholic (1230, 30%)
# #3:Jehovah's Witnes (25, 1%), 4:Christ Scientist (7, 0%)
# #5:Seventh Day Adv (17, 0%), 6:Protestant (2025, 50%), 7:other (111, 3%)
#
#
# # Method for describing part of a data frame:
# describe(death.time ~ age*sex + rcs(blood.pressure))
# describe(~ age+sex)
# describe(~ age+sex, weights=freqs) # weighted analysis
#
# fit <- lrm(y ~ age*sex + log(height))
# describe(formula(fit))
# describe(y ~ age*sex, na.action=na.delete)
# # report on number deleted for each variable
# options(na.detail.response=TRUE)
# # keep missings separately for each x, report on dist of y by x=NA
# describe(y ~ age*sex)
# options(na.fun.response="quantile")
# describe(y ~ age*sex) # same but use quantiles of y by x=NA
#
# d <- describe(my.data.frame)
# d$age # print description for just age
# d[c('age','sex')] # print description for two variables
# d[sort(names(d))] # print in alphabetic order by var. names
# d2 <- d[20:30] # keep variables 20-30
# page(d2) # pop-up window for these variables
#
# # Test date/time formats and suppression of times when they don't vary
# library(chron)
# d <- data.frame(a=chron((1:20)+.1),
# b=chron((1:20)+(1:20)/100),
# d=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20,
# hour=rep(11,20),min=rep(17,20),sec=rep(11,20)),
# f=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20,
# hour=1:20,min=1:20,sec=1:20),
# g=ISOdate(year=2001:2020,month=rep(3,20),day=1:20))
# describe(d)
#
# # Make a function to run describe, latex.describe, and use the kdvi
# # previewer in Linux to view the result and easily make a pdf file
#
# ldesc <- function(data) {
# options(xdvicmd='kdvi')
# d <- describe(data, desc=deparse(substitute(data)))
# dvi(latex(d, file='/tmp/z.tex'), nomargins=FALSE, width=8.5, height=11)
# }
#
# ldesc(d)
# ## End(Not run)
Run the code above in your browser using DataLab