R
easier for the author of this package
and are submitted to the public in the hope that they will be also be useful to others.The tools in this package can be grouped into five major categories:
memisc
provides facilities to work with what users from other
packages like SPSS, SAS, or Stata know as `variable labels', `value labels'
and `user-defined missing values'. In the context of this package these
aspects of the data are represented by the "description"
,
"labels"
, and "missing.values"
attributes of a
data vector.
These facilities are useful, for example, if you work with
survey data that contain coded items like vote intention that
may have the following structure:
Question: ``If there was a parliamentary election next tuesday, which party would you vote for?''
memisc
provides similar facilities.
Labels can be attached to codes by calls like labels(x) <- something
and expendanded by calls like labels(x) <- labels(x) + something
,
codes can be marked as `missing' by
calls like missing.values(x) <- something
and
missing.values(x) <- missing.values(x) + something
.
memisc
defines a class called "data.set", which is similar to the class "data.frame".
The main difference is that it is especially geared toward containing survey item data.
Transformations of and within "data.set" objects retain the information about
value labels, missing values etc. Using as.data.frame
sets the data up for
R's statistical functions, but doing this explicitely is seldom necessary.
See data.set
.
More Convenient Import of External Data
Survey data sets are often relative large and contain up to a few thousand variables.
For specific analyses one needs however only a relatively small subset of these variables.
Although modern computers have enough RAM to load such data sets completely into an R session,
this is not very efficient having to drop most of the variables after loading. Also, loading
such a large data set completely can be time-consuming, because R has to allocate space for
each of the many variables. Loading just the subset of variables really needed for an analysis
is more efficient and convenient - it tends to be much quicker. Thus this package provides
facilities to load such subsets of variables, without the need to load a complete data set.
Further, the loading of data from SPSS files is organized in such a way that all informations
about variable labels, value labels, and user-defined missing values are retained.
This is made possible by the definition of importer
objects, for which
a subset
method exists. importer
objects contain only
the information about the variables in the external data set but not the data.
The data itself is loaded into memory when the functions subset
or as.data.set
are used.
Recoding
memisc
also contains facilities for recoding
survey items. Simple recodings, for example collapsing answer
categories, can be done using the function recode
. More
complex recodings, for example the construction of indices from
multiple items, and complex case distinctions, can be done
using the function cases
. This function may also
be useful for programming, in so far as it is a generalization of
ifelse
.
Code Books
There is a function codebook
which produces a code book of an
external data set or an internal "data.set" object. A codebook contains in a
conveniently formatted way concise information about every variable in a data set,
such as which value labels and missing values are defined and some univariate statistics.
An extended example of all these facilities is contained in the vignette "anes48",
and in demo(anes48)
genTable
is a generalization of xtabs
:
Instead of counts, also descriptive statistics like means or variances
can be reported conditional on levels of factors. Also conditional
percentages of a factor can be obtained using this function.
In addition a formula
method for the aggregate
generic
function is provided, see. It has the same syntax as genTable
, but
gives a data frame of descriptive statistics instead of a table
object.
Per-Subset Analysis
By
is a variant of the
standard function by
: Conditioning factors
are specified by a formula and are
obtained from the data frame the subsets of which are to be analysed.
Therefore there is no need to attach
the data frame
or to use the dollar operator.
Graphical Model Comparison
Termplot
is a variant and an extension of termplot
:
The plots are similar to those of termplot
but uses lattice
graphics. Also Termplot
can be used on more than one model and allows
to compare the fit of linear or non-linear effect specifications of different
models.
Use example(Termplot)
or demo(Termplot)
for an example.
Journals of the Political and Social Sciences usually require that estimates of regression models are presented in the following form: ================================================== Model 1 Model 2 Model 3 -------------------------------------------------- Coefficients (Intercept) 30.628*** 6.360*** 28.566*** (7.409) (1.252) (7.355) pop15 -0.471** -0.461** (0.147) (0.145) pop75 -1.934 -1.691 (1.041) (1.084) dpi 0.001 -0.000 (0.001) (0.001) ddpi 0.529* 0.410* (0.210) (0.196) -------------------------------------------------- Summaries R-squared 0.262 0.162 0.338 adj. R-squared 0.230 0.126 0.280 N 50 50 50 ==================================================
Such tables of coefficient estimates can be produced
by mtable
. To see some of the possibilities of
this function, use example(mtable)
.
LaTeX Representation of R Objects
Output produced by mtable
can be transformed into
LaTeX tables by an appropriate method of the generic function
toLatex
which is defined in the package
utils
. In addition, memisc
defines toLatex
methods
for matrices and ftable
objects. Note that
results produced by genTable
can be coerced into
ftable
objects. Also, a default method
for the toLatex
function is defined which coerces its
argument to a matrix and applies the matrix method of toLatex
.
memisc
package defines a function Simulate
,
which can be used to conduct simulation experiments: For a given
number of replications and given sets of parameters (which can
be varied across experimental conditions) data are generated and
can summarized afterwards by other methods.Use example(Simulate)
, demo(monte.carlo)
, demo(lm.monte.carlo)
,
demo(random.walk)
, or demo(schellings)
for examples.
Sometimes users want to contruct loops that run over variables rather than values.
For example, if one wants to set the missing values of a battery of items.
For this purpose, the package contains the function foreach
.
To set 8 and 9 as missing values for the items knowledge1
,
knowledge2
, knowledge3
, one can use
foreach(x=c(knowledge1,knowledge2,knowledge3),
missing.values(x) <- 8:9)
Changing Names of Objects and Labels of Factors
R
already makes it possible to change the names of an object.
Substituting the names
or dimnames
can be done with some programming tricks. This package defines
the function rename
,
dimrename
, colrename
, and rowrename
that implement these tricks in a convenient way, so that programmers
(like the author of this package) need not reinvent the weel in
every instance of changing names of an object.
Dimension-Preserving Versions of lapply
and sapply
If a function that is involved in a call to
sapply
returns a result an array or a matrix, the
dimensional information gets lost. Also, if a list object to which
lapply
or sapply
are applied
have a dimension attribute, the result looses this information.
The functions Lapply
and
Sapply
defined in this package preserve such
dimensional information.
Combining Vectors and Arrays by Names
The generic function collect
collects several objects of the
same mode into one object, using their names, rownames
,
colnames
and/or dimnames
. There are methods for
atomic vectors, arrays (including matrices), and data frames.
For example
a <- c(a=1,b=2)
b <- c(a=10,c=30)
collect(a,b)
leads to
x y
a 1 10
b 2 NA
c NA 30
Reordering of Matrices and Arrays
The memisc
package includes a reorder
method for arrays and matrices. For example, the matrix
method by default reorders the rows of a matrix according the results
of a function.