Plot: Scatter Plot for One (Dot Plot) or Two Variables

Description

Abbreviation: plt

Plots individual points. For two variables a scatterplot is produced and for a data frame of numeric variables, a scatterplot matrix and correlation matrix are produced for all the variables in the data frame. If the values of the first specified value are sorted, then a line plot is genreated in place of the scatterplot.

The first variable can be numeric or a factor. The second variable must be numeric. For a single numeric variable, plots line segments for the plot of a function or a run chart, including an option for adding dates to the horizontal axis for a time series chart. One enhancement over the standard R plot function is the automatic inclusion of color. The color of the line segments and/or the points, background, area under the plotted line segments, grid lines, and border can each be explicitly specified, with default colors provided, or by one of the pre-defined color themes as defined by the set function. For Likert style response data of two variables, so that each value has less than 10 unique integer values, the points in the plot are transformed into a bubble plot with the size of each bubble, i.e., point, determined by the corresponding joint frequency.

For one variable, plots a one dimensional scatter plot, that is, a dot chart, also called a strip chart. Also identifies outliers according to the criteria specified by a box plot.

If a scatterplot of two numeric variables is displayed, then the corresponding correlation coefficient as well as the hypothesis test of zero population correlation and the 95% confidence interval are also displayed. The same numeric values of the standard R function cor.test function are generated, though in a more readable format. Also, an option for the .95 data ellipse from John Fox's car package can enclose the points of the scatterplot.

Usage

Plot(x, y=NULL, dframe=mydata, type=NULL, ncut=4,
         col.line=NULL, col.area=NULL, col.box="black",
         col.pts=NULL, col.fill=NULL, trans.pts=NULL,
         pch=NULL, col.grid=NULL, col.bg=NULL,
         colors=c("blue", "gray", "rose", "green", "gold", "red"),
         cex.axis=.85, col.axis="gray30",
         col.ticks="gray30", xy.ticks=TRUE,
         xlab=NULL, ylab=NULL, main=NULL, cex=NULL,
         x.start=NULL, x.end=NULL, y.start=NULL, y.end=NULL,
         time.start=NULL, time.by=NULL, time.reverse=FALSE,
         kind=c("default", "regular", "bubble", "sunflower"),
         fit.line=c("none", "loess", "ls"), col.fit.line="grey55",
         center.line=NULL,
         col.bubble=NULL, bubble.size=.25, col.flower=NULL,
         ellipse=FALSE, col.ellipse="lightslategray", fill.ellipse=TRUE, 
         pt.reg=21, pt.out=19, 
         col.out30="firebrick2", col.out15="firebrick4", new=TRUE,
         text.out=TRUE, ...)
plt(...)

Arguments

If both x and y are specified, then the x values are plotted on the horizontal axis. If x is not sorted, a scatter plot is produced. If x is sorted, then a function is plotted with a smooth line. If only x is specified with no y, then

Coordinates of points in the plot on the vertical axis.

dframe

Optional data frame that contains one or both of the variables of interest, default is mydata.

type

Character string that indicates the type of plot, either "p" for points, "l" for line, or "b" for both. If x and y are provided and x is sorted so that a function is plotted, the default is "

ncut

When analyzing all the variables in a data frame, specifies the largest number of unique values of variable of a numeric data type for which the variable will be analyzed as a categorical. Set to 0 to turn off.

col.line

Color of any plotted line segments, with a default of "darkblue".

col.area

Color of area under the plotted line segments. To have a border at the bottom and right of a run chart but retain the property of no area color, specify a color of "transparent". If the values exhibit a trend and dates

col.box

Color of border around the plot background, the box, that encloses the plot, with a default of "black".

col.pts

Color of the border of the plotted points.

col.fill

For plotted points, the interior color of the point. For a scatterplot the default value is transparent. For a run chart the default value is the color of the point's border, col.pts.

trans.pts

Transparency of the plotted points, from opaque at 0 to perfectly transparent at 1. Default is 0.6

pch

The standard plot character, with values defined in help(points). The default value is 21, a circle with both a border and filled area, specified here as col.pts and col.fill. For a scatterplot, col.fill

col.grid

Color of the grid lines, with a default of "grey90".

col.bg

Color of the plot background.

colors

Sets the color palette.

cex.axis

Scale magnification factor, which by defaults displays the axis values to be smaller than the axis labels.

col.axis

Color of the font used to label the axis values.

col.ticks

Color of the ticks used to label the axis values.

xy.ticks

Flag that indicates if tick marks and associated values on the axes are to be displayed.

xlab

Label for x-axis. For two variables specified, x and y, if xlab not specified, then the label becomes the name of the corresponding variable. If xy.ticks is FALSE, then no label is displayed. If no y v

ylab

Label for y-axis. If not specified, then the label becomes the name of the corresponding variable. If xy.ticks is FALSE, then no label displayed.

main

Label for the title of the graph. If the corresponding variable labels exist in the data frame mylabels, then the title is set by default from the corresponding variable labels.

cex

Magnification factor for any displayed points, with default of cex=1.0.

center.line

Plots a dashed line through the middle of a run chart. The two possible values are "mean" and "median". Provides a centerline for the "median" by default when the values randomly vary about the mean.

x.start

For Likert style response data, the starting integer value of the x-axis. Useful if the actual data do not include all possible values.

x.end

For Likert style response data, the ending integer value of the x-axis. Useful if the actual data do not include all possible values.

y.start

For Likert style response data, the starting integer value of the y-axis. Useful if the actual data do not include all possible values.

y.end

For Likert style response data, the ending integer value of the y-axis. Useful if the actual data do not include all possible values.

time.start

Optional starting date for first data value. Format must be "%Y-%m-%d" or "%Y/%m/%d". If using with x.reverse, the first date is after the data are reverse sorted. Not needed if data are a time series with

time.by

Accompanies the time.start specification, the interval to increment the date for each sequential data value. A character string, containing one of "day", "week", "month" or "year"

time.reverse

When TRUE, reverse the ordering of the dates, particularly when the data are listed such that first row of data is the newest. Accompanies the time.start specification.

kind

Default is "default", which becomes a "regular" scatterplot for most data. If Likert style response data is plotted, that is, each variable has less than 10 integer values, then instead by default a bubble plot is

fit.line

The best fitting line. Default value is "none", with options for "loess" and "ls".

col.fit.line

Color of the best fitting line, if the fit.line option is invoked.

col.bubble

Color of the bubbles if a bubble plot of the frequencies is plotted.

bubble.size

Size of the bubbles in a bubble plot of Likert style data.

col.flower

Color of the flowers if a sunflower plot of the frequencies is plotted.

ellipse

If TRUE, enclose a scatterplot with the .95 data ellipse from the car package.

col.ellipse

Color of the ellipse.

fill.ellipse

If TRUE, fill the ellipse with a translucent shade of col.ellipse.

pt.reg

Type of regular (non-outlier) point. See help for points for more information. Default is 21, a circle with no fill.

pt.out

Type of point for outliers. Default is 19, a filled circle.

col.out30

Color of severe outliers.

col.out15

Color of potential outliers.

text.out

If TRUE, then display text output in console.

new

If TRUE, then add the dp to an existing graph.

...

Other parameter values for graphics as defined by and then processed by plot and par, including xlim, ylim, lwd,

Details

DATA FRAME ACCESS If the variable is in a data frame, the input data frame has the assumed name of mydata. If this data frame is named something different, then specify the name with the dframe option. Regardless of its name, the data frame need not be attached to reference the variable directly by its name, that is, no need to invoke the mydata$name notation. If two variables are specified, both variables should be in the data frame, or one of the variables is in the data frame and the other in the user's workspace, the global environment.

ADAPTIVE GRAPHICS Results are based on the standard plot and related graphic functions, with the additional provided color capabilities and other options including a center line. The plotting procedure utilizes ``adaptive graphics'', such that plt chooses different default values for different characteristics of the specified plot and data values. The goal is to produce a desired graph from simply relying upon the default values, both of the plt function itself, as well as the base R functions called by Plot, such as plot. Familiarity with the options permits complete control over the computed defaults, but this familiarity is intended to be optional for most situations.

TWO VARIABLE PLOT When two variables are specified to plot, by default if the values of the first variable, x, are unsorted, or if there are unequal intervals between adjacent values, or if there is missing data for either variable, a scatterplot is produced, that is, a call to the standard R plot function with type="p" for points. By default, sorted values with equal intervals between adjacent values of the first of the two specified variables yields a function plot if there is no missing data for either variable, that is, a call to the standard R plot function with type="l", which connects each adjacent pair of points with a line segment.

SCATTERPLOT ELLIPSE For a scatterplot of two numeric variables, the ellipse=TRUE option draws the .95 data ellipse as computed by the dataEllipse function, written by Georges Monette and John Fox, from the car package. Usually the minimum and maximum values of the axes should be manually extended beyond their default to accommodate the entire ellipse. To accomplish this extension, use the xlim and ylim options, such as xlim=c(30,350). Obtaining the desired axes limits may involve multiple runs of the Plot function. To provide more control over the display of the data ellipse beyond the provided col.ellipse and fill.ellipse options, run the dataEllipse function directly with the plot.points=FALSE option following Plot with ellipse=FALSE, the default.

ONE VARIABLE PLOT Results are based on the standard stripchart function. Colors are provided by default and can also be specified.

MULTIPLE VARIABLE PLOT If the variable, x is a data frame, then the data frame must contain only numeric variables. If not, the first non-numeric variable is noted and the procedure ends. Otherwise, the procedure generates the scatterplot matrix with the R pairs function as well as the correlation matrix of all the variables in the data frame with the R cor function.

LIKERT DATA A scatterplot of Likert type data is problematic because there are so few possibilities for points in the scatterplot. For example, for a scatterplot of two five-point Likert response data, there are only 25 possible paired values to plot, so most of the plotted points overlap with others. In this situation, that is, when there are less than 10 values for each of the two variables, a bubble plot is automatically provided, with the size of each point relative to the joint frequency of the paired data values. A sunflower plot can be requested in lieu of the bubble plot.

COLOR The default background color of col.bg=ghostwhite provides a very mild cool tone with a slight emphasis on blue. The entire color theme can be specified at the system level with the lessR function set using the colors option. Or, use the same option for Plot to set the color theme just for one scatterplot. The default color theme is blue, but a gray scale is available with "gray", and other themes are available as explained in the help function for set.

Colors can also be changed for individual aspects of a scatterplot as well. To provide a warmer tone by slightly enhancing red, try col.bg=snow. Obtain a very light gray with col.bg=gray99. To darken the background gray, try col.bg=gray97 or lower numbers. See the lessR function showColors which provides an example of all available named colors.

ADDITIONAL OPTIONS Commonly used graphical parameters that are available to the standard R function plot are also generally available to Plot, such as:

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

References

Monette, G. and Fox, J., dataEllipse function from the car package.

Examples

Run this code

# scatter plot
# create simulated data, no population mean difference
# X has two values only, Y is numeric
# put into a data frame, required for formula version
n <- 12
f <- sample(c("Group1","Group2"), size=n, replace=TRUE)
x <- round(rnorm(n=n, mean=50, sd=10), 2)
y <- round(rnorm(n=n, mean=50, sd=10), 2)
z <- round(rnorm(n=n, mean=50, sd=10), 2)
mydata <- data.frame(f,x,y,z)
rm(f); rm(x); rm(y); rm(z)

# default scatterplot, x is not sorted so type is set to "p"
Plot(x, y)
# short name
plt(x,y)
# compare to standard R plot, which requires the mydata$ notation
plot(mydata$x, mydata$y)
# scatterplot, with ellipse and extended axes to accommodate the ellipse
Plot(x, y, ellipse=TRUE, xlim=c(20,80), ylim=c(20,80))
# scatterplot, with loess line 
Plot(x, y, fit.line="loess")
# increase span (smoothing) from default of .75
Plot(x, y, fit.line="loess", span=1)
# custom scatter plot
Plot(x, y, col.pts="darkred", col.fill="plum")
# scatter plot with a gray scale color theme 
Plot(x, y, colors="gray")

# scatterplot matrix and correlation matrix
# first remove the categorical variable f from mydata
mydata <- subset(mydata, select=c(x:z))
# now analyze remaining variables x, y and z
Plot(mydata)

# bubble plot of simulated Likert data, 1 to 7 scale
# size of each plotted point (bubble) depends on its joint frequency
# triggered by default when  < 10 unique values for each variable
x1 <- sample(1:7, size=100, replace=TRUE)
x2 <- sample(1:7, size=100, replace=TRUE)
Plot(x1,x2)
# compare to usual scatterplot of Likert data, transparency helps
Plot(x1,x2, kind="regular")
Plot(x1,x2, kind="regular", cex=3, trans.pts=.7)
# plot Likert data and get sunflower plot with loess line
Plot(x1,x2, kind="sunflower", fit.line="loess")

# scatterplot of continuous Y against categorical X, a factor
Pain <- sample(c("None", "Some", "Much", "Massive"), size=25, replace=TRUE)
Pain <- factor(Pain, levels=c("None", "Some", "Much", "Massive"), ordered=TRUE)
Cost <- round(rnorm(25,1000,100),2)
Plot(Pain, Cost)
# for this purpose, improved version of standard R stripchart
stripchart(Cost ~ Pain, vertical=TRUE)

# line chart, that is, function curve
x <- seq(10,500,by=1) 
y <- 18/sqrt(x)
# x is sorted with equal intervals so type set to "l" for line
Plot(x, y)
# custom function plot
Plot(x, y, ylab="My Y", xlab="My X", col.line="blue", 
  col.bg="snow", col.area="lightsteelblue", col.grid="lightsalmon")

# Default dot plot
Plot(y)
# Dot plot with custom colors for outliers
Plot(y, pt.reg=23, col.out15="blue", col.out30="red")

# modern art
n <- sample(2:30, size=1)
x <- rnorm(n)
y <- rnorm(n)
clr <- colors()
color1 <- clr[sample(1:length(clr), size=1)]
color2 <- clr[sample(1:length(clr), size=1)]
Plot(x, y, type="l", lty="dashed", lwd=3, col.area=color1, 
   col.line=color2, xy.ticks=FALSE, main="Modern Art", 
   cex.main=2, col.main="lightsteelblue", kind="regular",
   ncut=0)


# --------------------------------------------
# plots for data frames and multiple variables
# --------------------------------------------

# create data frame, mydata, to mimic reading data with rad function
# mydata contains both numeric and non-numeric data
mydata <- data.frame(rnorm(100), rnorm(100), rnorm(100), rep(c("A","B"),50))
names(mydata) <- c("X","Y","Z","C")

# although data not attached, access each variable directly by its name
Plot(X)
Plot(X,Y)

# variable of interest is in a data frame which is not the default mydata
# access the breaks and wool variables in the R provided warpbreaks data set
# wool is categorical with two levels, breaks is numeric
# although data not attached, access the variable directly by its name
data(warpbreaks)
Plot(wool, breaks, dframe=warpbreaks)

Run the code above in your browser using DataLab