This function uses basic R graphics to draw a two-dimensional scatterplot, with options to allow for plot enhancements that are often helpful with regression problems. Enhancements include adding marginal boxplots, estimated mean and variance functions using either parametric or nonparametric methods, point identification, jittering, setting characteristics of points and lines like color, size and symbol, marking points and fitting lines conditional on a grouping variable, and other enhancements.
sp
is an abbreviation for scatterplot
.
scatterplot(x, ...)# S3 method for formula
scatterplot(formula, data, subset, xlab, ylab, id=FALSE,
legend=TRUE, ...)
# S3 method for default
scatterplot(x, y, boxplots=if (by.groups) "" else "xy",
regLine=TRUE, legend=TRUE, id=FALSE, ellipse=FALSE, grid=TRUE,
smooth=TRUE,
groups, by.groups=!missing(groups),
xlab=deparse(substitute(x)), ylab=deparse(substitute(y)),
log="", jitter=list(), cex=par("cex"),
col=carPalette()[-1], pch=1:n.groups,
reset.par=TRUE, ...)
sp(x, ...)
vector of horizontal coordinates (or first argument of generic function).
vector of vertical coordinates.
a model formula, of the form y ~ x
or, if plotting by
groups, y ~ x | z
, where z
evaluates to a factor
or other variable dividing the data into groups. If x
is a factor, then parallel boxplots
are produced using the Boxplot
function.
data frame within which to evaluate the formula.
expression defining a subset of observations.
if "x"
a marginal boxplot for the horizontal x
-axis is drawn below the plot;
if "y"
a marginal boxplot for vertical y
-axis is drawn to the left of the plot;
if "xy"
both marginal boxplots are drawn; set to ""
or FALSE
to
suppress both boxplots.
controls adding a fitted regression line to the plot. if
regLine=FALSE
, no line is drawn. If TRUE
, the default, an OLS
line is fit. This argument can also be a list. The default of TRUE
is
equivalent to regLine=list(method=lm, lty=1, lwd=2, col=col)
, which specifies
using the lm
function to estimate the fitted line, to draw a solid line
(lty=1
) of width 2 times the nominal width (lwd=2
) in the color given by
the first element of the col
argument described below.
when the plot is drawn by groups and legend=TRUE
, controls placement
and properties of a
legend; if FALSE
, the legend is suppressed. Can be a list of
named arguments, as follows: title
for the legend; inset
, giving space
as a proportion of the axes to offset the legend from the axes; coords
specifying the position of the legend in any form acceptable to the
legend
function or, if not given, the legend is placed above
the plot in the upper margin; columns
for the legend, determined automatically
to prefer a horizontal layout if not given explicitly; cex
giving the
relative size of the legend symbols and text. TRUE
(the default) is equivalent to
list(title=deparse(substitute(groups)), inset=0.02, cex=1)
.
controls point identification; if FALSE
(the default), no points are
identified; can be a list of named arguments to the showLabels
function;
TRUE
is equivalent to
list(method="mahal", n=2, cex=1, col=carPalette()[-1], location="lr")
,
which identifies the 2 points (in each group) with the largest Mahalanobis distances
from the center of the data. See showLabels
for a description of the
other arguments. The default behavior of id
is not the same in all graphics
functions in car, as the method
used depends on the type of plot.
controls plotting data-concentration ellipses. If FALSE
(the default), no ellipses are plotted. Can be a list of named values giving
levels
, a vector of one or more bivariate-normal probability-contour levels at
which to plot the ellipses; robust
, a logical value determing whether to use
the cov.trob
function in the MASS package to calculate the center
and covariance matrix for the data ellipses; and fill
and fill.alpha
,
which control whether the ellipse is filled and the transparency of the fill.
TRUE
is equivalent to
list(levels=c(.5, .95), robust=TRUE, fill=TRUE, fill.alpha=0.2)
.
If TRUE, the default, a light-gray background grid is put on the graph
specifies a nonparametric estimate of the mean or median
function of the vertical axis variable given the
horizontal axis variable and optionally a nonparametric estimate of the conditional variance. If
smooth=FALSE
neither function is drawn. If smooth=TRUE
, then both the mean function
and variance funtions are drawn for ungrouped data, and the mean function only is drawn
for grouped
data. The default smoother is loessLine
, which uses the
loess
function from
the stats package. This smoother is fast and reliable. See the details below
for changing
the smoother, line type, width and color, of the added lines, and adding arguments
for the smoother.
a factor or other variable dividing the data into groups; groups are plotted with different colors, plotting characters, fits, and smooths. Using this argument is equivalent to specifying the grouping variable in the formula.
if TRUE
(the default when there are groups), regression lines are fit by groups.
label for horizontal axis.
label for vertical axis.
same as the log
argument to plot
, to produce log axes.
a list with elements x
or y
or both, specifying jitter factors
for the horizontal and vertical coordinates of the points in the scatterplot. The
jitter
function is used to randomly perturb the points; specifying a
factor of 1
produces the default jitter.
Fitted lines are unaffected by the jitter.
with no grouping, this specifies a color for plotted points;
with grouping, this argument should be a vector
of colors of length at least equal to the number of groups. The default is
value returned by carPalette[-1]
.
plotting characters for points; default is the plotting characters in
order (see par
).
sets the size of plotting characters, with cex=1
the standard size. You can also
set the sizes of other elements with the arguments cex.axis
, cex.lab
, cex.main
,
and cex.sub
. See par
.
if TRUE
(the default) then plotting parameters are reset to their previous values
when scatterplot
exits; if FALSE
then the mar
and mfcol
parameters are
altered for the current plotting device. Set to FALSE
if you want to add graphical elements
(such as lines) to the plot.
other arguments passed down and to plot
. For example, the argument las
sets
the style of the axis labels, and xlim
and ylim
set the limits on the horizontal and
vertical axes, respectively; see par
.
If points are identified, their labels are returned; otherwise NULL
is returned invisibly.
Many arguments to scatterplot
were changed in version 3 of car to simplify use of
this function.
The smooth
argument is usually either set to TRUE
or FALSE
to draw, or omit,
the smoother. Alternatively smooth
can be set to a list of arguments. The default behavior of
smooth=TRUE
is equivalent to smooth=list(smoother=loessLine, var=!by.groups, lty.var=2, lty.var=4)
, specifying the smoother to be used, including the variance smooth,
and the line widths and types for the curves. You can also specify the colors you want to use for the mean and variance smooths with the arguments col.smooth
and col.var
. Alternative smoothers are gamline
which uses the
gam
function from the mgcv package, and quantregLine
which uses quantile regression to
estimate the median and quartile functions using rqss
from the quantreg package. All of these
smoothers have one or more arguments described on their help pages, and these arguments can be added to the
smooth
argument; for example, smooth = list(span=1/2)
would use the default
loessLine
smoother,
include the variance smooth, and change the value of the smoothing parameter to 1/2. For loessLine
and gamLine
the variance smooth is estimated by separately
smoothing the squared positive and negative
residuals from the mean smooth, using the same type of smoother. The displayed curves are equal to
the mean smooth plus the square root of the fit to the positive squared residuals, and the mean fit minus
the square root of the smooth of the negative squared residuals. The lines therefore represent the
comnditional variabiliity at each value on the horizontal axis. Because smoothing is done separately for
positive and negative residuals, the variation shown will generally not be symmetric about the fitted mean
function. For the quantregLine
method, the center estimates the median for each value on the
horizontal axis, and the variability estimates the lower and upper quartiles of the estimated conditional
distribution for each value of the horizontal axis.
The sub-arguments spread
, lty.spread
and col.spread
of the smooth
argument are equivalent to the newer var
, col.var
and lty.var
, respectively, recognizing that the spread is a measuure of conditional variability.
Fox, J. and Weisberg, S. (2019) An R Companion to Applied Regression, Third Edition, Sage.
boxplot
,
jitter
, legend
,
scatterplotMatrix
, dataEllipse
, Boxplot
,
cov.trob
,
showLabels
, ScatterplotSmoothers
.
# NOT RUN {
scatterplot(prestige ~ income, data=Prestige, ellipse=TRUE)
scatterplot(prestige ~ income, data=Prestige, smooth=list(smoother=quantregLine))
# use quantile regression for median and quartile fits
scatterplot(prestige ~ income | type, data=Prestige,
smooth=list(smoother=quantregLine, var=TRUE, span=1, lwd=4, lwd.var=2))
scatterplot(prestige ~ income | type, data=Prestige, legend=list(coords="topleft"))
scatterplot(vocabulary ~ education, jitter=list(x=1, y=1),
data=Vocab, smooth=FALSE, lwd=3)
scatterplot(infantMortality ~ ppgdp, log="xy", data=UN, id=list(n=5))
scatterplot(income ~ type, data=Prestige)
# }
# NOT RUN {
# remember to exit from point-identification mode
scatterplot(infantMortality ~ ppgdp, id=list(method="identify"), data=UN)
# }
Run the code above in your browser using DataLab