bystats: Statistics by Categories

Description

For any number of cross-classification variables, bystats returns a matrix with the sample size, number missing y, and fun(non-missing y), with the cross-classifications designated by rows. Uses Harrell's modification of the interaction function to produce cross-classifications. The default fun is mean, and if y is binary, the mean is labeled as Fraction. There is a print method as well as a latex method for objects created by bystats. bystats2 handles the special case in which there are 2 classifcation variables, and places the first one in rows and the second in columns. The print method for bystats2 uses the print.char.matrix function to organize statistics for cells into boxes.

Usage

bystats(y, ..., fun, nmiss, subset)
# S3 method for bystats
print(x, ...)
# S3 method for bystats
latex(object, title, caption, rowlabel, ...)
bystats2(y, v, h, fun, nmiss, subset)
# S3 method for bystats2
print(x, abbreviate.dimnames=FALSE,
   prefix.width=max(nchar(dimnames(x)[[1]])), ...)
# S3 method for bystats2
latex(object, title, caption, rowlabel, ...)

Value

for bystats, a matrix with row names equal to the classification labels and column names N, Missing, funlab, where funlab is determined from fun. A row is added to the end with the summary statistics computed on all observations combined. The class of this matrix is bystats. For bystats, returns a 3-dimensional array with the last dimension corresponding to statistics being computed. The class of the array is bystats2.

Arguments

y: a binary, logical, or continuous variable or a matrix or data frame of such variables. If y is a data frame it is converted to a matrix. If y is a data frame or matrix, computations are done on subsets of the rows of y, and you should specify fun so as to be able to operate on the matrix. For matrix y, any column with a missing value causes the entire row to be considered missing, and the row is not passed to fun.
...: For bystats, one or more classifcation variables separated by commas. For print.bystats, options passed to print.default such as digits. For latex.bystats, and latex.bystats2, options passed to latex.default such as digits. If you pass cdec to latex.default, keep in mind that the first one or two positions (depending on nmiss) should have zeros since these correspond with frequency counts.
v: vertical variable for bystats2. Will be converted to factor.
h: horizontal variable for bystats2. Will be converted to factor.
fun: a function to compute on the non-missing y for a given subset. You must specify fun= in front of the function name or definition. fun may return a single number or a vector or matrix of any length. Matrix results are rolled out into a vector, with names preserved. When y is a matrix, a common fun is function(y) apply(y, 2, ff) where ff is the name of a function which operates on one column of y.
nmiss: A column containing a count of missing values is included if nmiss=TRUE or if there is at least one missing value.
subset: a vector of subscripts or logical values indicating the subset of data to analyze
abbreviate.dimnames: set to TRUE to abbreviate dimnames in output
prefix.width: see print.char.matrix
title: title to pass to latex.default. Default is the first word of the character string version of the first calling argument.
caption: caption to pass to latex.default. Default is the heading attribute from the object produced by bystats.
rowlabel: rowlabel to pass to latex.default. Default is the byvarnames attribute from the object produced by bystats. For bystats2 the default is "".
x: an object created by bystats or bystats2
object: an object created by bystats or bystats2

Side Effects

latex produces a .tex file.

Author

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

Examples

Run this code

if (FALSE) {
bystats(sex==2, county, city)
bystats(death, race)
bystats(death, cut2(age,g=5), race)
bystats(cholesterol, cut2(age,g=4), sex, fun=median)
bystats(cholesterol, sex, fun=quantile)
bystats(cholesterol, sex, fun=function(x)c(Mean=mean(x),Median=median(x)))
latex(bystats(death,race,nmiss=FALSE,subset=sex=="female"), digits=2)
f <- function(y) c(Hazard=sum(y[,2])/sum(y[,1]))
# f() gets the hazard estimate for right-censored data from exponential dist.
bystats(cbind(d.time, death), race, sex, fun=f)
bystats(cbind(pressure, cholesterol), age.decile, 
        fun=function(y) c(Median.pressure   =median(y[,1]),
                          Median.cholesterol=median(y[,2])))
y <- cbind(pressure, cholesterol)
bystats(y, age.decile, 
        fun=function(y) apply(y, 2, median))   # same result as last one
bystats(y, age.decile, fun=function(y) apply(y, 2, quantile, c(.25,.75)))
# The last one computes separately the 0.25 and 0.75 quantiles of 2 vars.
latex(bystats2(death, race, sex, fun=table))
}

Run the code above in your browser using DataLab