Plot: Plot One or Two Continuous and/or Categorical Variables

Description

Abbreviation: sp, ScatterPlot

A scatterplot displays the values of a distribution, or the relationship between the two distributions in terms of their joint values, as a set of points in an n-dimensional coordinate system, in which the coordinates of each point are the values of n variables for a single observation (row of data). From the identical syntax, from any combination of continuous or categorical variables variables x and y, Plot(x) or Plot(x,y), where x or y can be a vector, by default generates a family of related 1- or 2-variable scatterplots, possibly enhanced, as well as related statistical analyses. A categorical variable is either non-numeric, such as an R factor, or may be defined to consist of a small number of equally spaced integer values. The maximum number of such values to define such an integer variable as categorical is set by the n.cat parameter, with a default value of 0, that is, by default, all variables with numerical values are defined as continuous variables.

Plot is a general function, which produces a wide variety of scatterplots, outlined in the following list. The parameter definitions that follow this list are grouped, with parameters that relate to the same type of plot defined in the same group.

Plot(x,y): x and y continuous yields traditional scatterplot of two continuous variables Plot(x,y): x and y categorical, to solve the over-plot problem, yields a bubble (balloon) scatterplot, the size of each bubble based on the corresponding joint frequency as a replacement for the two dimensional bar chart Plot(x,y): x (or y) categorical and the other variable continuous, yields a scatterplot with means at each level of the categorical variable Plot(x,y): x (or y) categorical with unique (ID) values and the other variable continuous, yields a Cleveland dot plot Plot(X,y) or Plot(x,Y): one vector variable defined by several continuous variables, paired with another single continuous variable, yields multiple scatterplots on the same graph Plot(x): one continuous variable generates either, a violin/box/scatterplot (VBS plot), introduced here, or a run chart with run=TRUE, or x can be an R time series variable for a time series chart Plot(x): one categorical variable yields a 1-dimensional bubble plot to solve the over-plot problem for a more compact replacement of the traditional bar chart Plot(X): one vector of continuous variables, with no y-variable, results in a scatterplot matrix Plot(X): one vector of categorical x-variables, with no y-variable, generalizes to a matrix of 1-dimensional bubble plots, here called the bubble plot frequency matrix, to replace a series of bar charts

Represent the influence of additional categorical variables with by1 or by2 to generate Trellis plots conditioned on one or two variables from implicit calls to functions from Deepayan Sarkar's (2009) lattice package. Use by to group multiple variables on the same plot, or on multiple panels if Trellis graphics are activated. For a third variable, which is continuous, specify size for a bubble plot. By default, the values of analysis that generate the plotted points is data, or choose other values to plot, which are statistics computed from the data such as mean.

Usage

Plot(x, y=NULL, data=mydata,
         values=c("data", "count", "prop", "sum", "mean", "sd",
                  "min", "median", "max"),
         n.cat=getOption("n.cat"),
         by=NULL, by1=NULL, by2=NULL,
         n.row=NULL, n.col=NULL, aspect="fill",
         size=NULL, size.cut=NULL, shape="circle", means=TRUE,
         sort.yx=FALSE, segments.y=FALSE, segments.x=FALSE,
         jitter.x=0, jitter.y=0,
         ID="row.name", ID.size=0.75,
         MD.cut=0, out.cut=0, out.shape="circle", out.size=1,
         vbs.plot="vbs", vbs.pt.fill=c("black", "default"), bw=NULL,
         vbs.size=0.9, vbs.mean=FALSE, fences=FALSE,
         k=1.5, box.adj=FALSE, a=-4, b=3,
         radius=0.25, power=0.6,
         low.fill=NULL, hi.fill=NULL, proportion=FALSE,
         smooth=FALSE, smooth.points=100, smooth.trans=0.25,
         smooth.bins=128,
         fit=FALSE, fit.se=0, ellipse=FALSE, 
         bin=FALSE, bin.start=NULL, bin.width=NULL, bin.end=NULL,
         breaks="Sturges", cumul=FALSE,
         run=FALSE, lwd=2, area=FALSE, area.origin=0, 
         center.line=c("default", "mean", "median", "zero", "off"),
         show.runs=FALSE, stack=FALSE,
         add=NULL, x1=NULL, y1=NULL, x2=NULL, y2=NULL,
         xlab=NULL, ylab=NULL, main=NULL, sub=NULL,
         xy.ticks=TRUE, value.labels=NULL, label.max=20, origin.x=NULL,
         auto=FALSE, digits.d=NULL, quiet=getOption("quiet"),
         do.plot=TRUE, width=NULL, height=NULL, pdf.file=NULL, 
         fun.call=NULL, …)
ScatterPlot(…)
sp(…)
BoxPlot(…)
bx(…)

Arguments

By itself, or with y, by default, a primary variable, that is, plotted by its values mapped to coordinates. The data values can be continuous or categorical, cross-sectional or a time series. If x is sorted, with equal intervals separating the values, or is a time series, then by default plots the points sequentially, joined by line segments. Can specify multiple x-variables or multiple y-variables as vectors, but not both. Can be in a data frame or defined in the global environment.

An optional second primary variable. Variable with values to be mapped to coordinates of points in the plot on the vertical axis. Can be continuous or categorical. Can be in a data frame or defined in the global environment.

data

Optional data frame that contains one or both of x and y. Default data frame is mydata.

values

The values that are the coordinates from which to plot the points, data values by default. For y, which is continuous, then for either a categorical x variable, or a continuous x variable with values binned into categories, then can apply "mean", etc.

n.cat

Number of categories, specifies the largest number of unique, equally spaced integer values of a variable for which the variable will be analyzed as categorical instead of continuous. Default is 0. Use to specify that such variables are to be analyzed as categorical, a kind of informal R factor.

A categorical variable to provide a scatterplot for each level of the numeric primary variables x and y on the same plot, a grouping variable. For two varaiable plots, applies to the panels of a Trellis graphic if by1 is specified.

by1

A categorical variable called a conditioning variable that activates Trellis graphics, provided by Deepayan Sarkar's (2007) lattice package, to provide a separate scatterplot (panel) of numeric primary variables x and y for each level of the variable.

by2

A second conditioning variable to generate Trellis plots jointly conditioned on both the by1 and by2 variables, with by2 as the row variable, which yields a scatterplot (panel) for each cross-classification of the levels of numeric x and y variables.

n.row

Optional specification for the number of rows in the layout of a multi-panel display with Trellis graphics. Specify n.col or n.row, but not both.

n.col

Optional specification for the number of columns in the layout of a multi-panel display with Trellis graphics. Specify n.col or n.row, but not both. If set to 1, then the strip that labels each group locates to the left of each plot instead of the top.

aspect

Lattice parameter for the aspect ratio of the panels, defined as height divided by width. The default value is "fill" to have the panels expand to occupy as much space as possible. Set to 1 for square panels. Set to "xy" to specify a ratio calculated to "bank" to 45 degrees, that is, with the line slope approximately 45 degrees.

size

When set to a constant, the scaling factor for standard points (not bubbles) or a line, with default of 1.0 for points and 2.0 for a line. Set to 0 to not plot the points or lines. When set to a variable, activates a bubble plot with the size of each bubble further determined by the value of radius. Applies to the standard two-variable scatterplot as well as to the scatterplot component of the integrated Violin-Box-Scatterplot (VBS) of a single continuous variable.

size.cut

If TRUE (or 1), then for a bubble plot of two variables in which the bubble sizes are defined by a size variable, show the value of the sizing variable for selected bubbles in the center of the bubbles, unless the bubble is too small. If FALSE, no value is displayed. If a number greater than 1, then display the value only for the corresponding quantiles, such as just the max and min for a setting of 2, the default value when bubbles represent a size variable. Color of the displayed text set by bubble.text.

shape

The plot character(s). The default value is a circle with both an color and filled area, specified with color and fill. Possible values are circle, square, diamond, triup (triangle up), tridown (triangle down), all uppercase and lowercase letters, all digits, and most punctuation characters. The numbers 21 through 25 as defined by the R points function also apply. If plotting levels according to by, then list one shape for each level to be plotted.

means

If the one variable is categorical the other variable continuous, then if TRUE, by default, plot means with the scatterplot. Also applies to a 1-D scatterplot.

sort.yx

Sort the values of y by the values of x, such as for a Cleveland dot plot, that is, a numeric x-variable paired with a categorical y-variable with unique values. If a x is a vector of two variables, sort by their difference.

segments.y

For one x-variable, draw a line segment from the y-axis to each plotted point, such as for the Cleveland dot plot. For two x-variables, the line segments connect the two points.

segments.x

Draw a line segment from the x-axis for each plotted point.

jitter.x

Randomly perturbs the plotted points of a scatterplot horizontally according to an internally computed formula, or can be explicitly specified.

jitter.y

Randomly perturbs the plotted points of a scatterplot vertically according to an internally computed formula, or can be explicitly specified.

Name of variable to provide the labels for the plotted points, row names of data table (frame) by default.

ID.size

Size of the plotted labels, with a default of 0.75 according to the R parameter cex. Modify text color of the labels with the style function parameter ID.color.

MD.cut

Mahalanobis distance cutoff to define an outlier in a 2-variable scatterplot.

out.cut

Count or proportion of plotted points to label, in order of their distance from the scatterplot center (means), counting down from the more extreme point. For two-variable plots, assess distance from the center with Mahalanobis distance. For VBS plots of a single continuous variable, refers to outliers on each side of the plot.

out.shape

Shape of outlier points in a 2-variable scatterplot or a VBS plot. Modify fill color from the current theme with the style function parameters out.fill and out2.fill.

out.size

Size of outlier points in a 2-variable scatterplot or VBS plot.

vbs.plot

A character string that specifies the components of the integrated Violin-Box-Scatterplot (VBS) of a continuous variable. A "v" in the string indicates a violin plot, a "b" indicates a box plot with flagged outliers, and a "s" indicates a 1-variable scatterplot. Default value is "vbs". The characters can be in any order and upper- or lower-case. Generalize to Trellis plots with the by1 and by2 parameters, but currently only applies to horizontal displays. Modify fill and border colors from the current theme with the style function parameters violin.fill, violin.color, box.fill and box.color.

vbs.pt.fill

Points in a VBS scatterplot are black by default because the background is the violin, which is based on the current theme color. To use the values for pt.fill and pt.color specified by the style function, set to "default".

Bandwidth for the smoothness of the violin plot. Higher values for smoother plots. Default is to calculate a bandwidth that provides a relative smooth density plot.

vbs.size

Width of the violin plot to the plot area. Make the violin (and also the accompanying box plot) larger or smaller by making the plot area and/or this value larger or smaller.

vbs.mean

Show the mean on the box plot with a strip the color of out.fill, which can be changed with the style function.

fences

If TRUE, draw the inner upper and lower fences as dotted line segments.

IQR multiplier for the basis of calculating the distance of the whiskers of the box plot from the box. Default is Tukey's setting of 1.5.

box.adj

Adjust the box and whiskers, and thus outlier detection, for skewness using the medcouple statistic as the robust measure of skewness according to Hubert and Vandervieren (2008).

a, b

Scaling factors for the adjusted box plot to set the length of the whiskers. If explicitly set, activates box.adj.

radius

Scaling factor of the bubbles in a bubble plot, which sets the radius of the largest displayed bubble in inches, with default of 0.25 inches. To activate, set the value of size to a third variable, which sets the size of a bubble according to the size of the third variable. Or activate when the values of the variables are categorical, either a factor or an integer variable with the number of unique values less than n.cat, in which case the size of the bubbles represents frequency.

power

Relative size of the scaling of the bubbles to each other. Value of 0.5 scales the bubbles so that the area of each bubble is the value of the corresponding sizing variable. Value of 1 scales so the radius of the bubble is the value of the sizing variable, increasing the discrepancy of size between the variables. The default value is 0.6.

low.fill

For a categorical variable and the resulting bubble plot, or a matrix of these plots, sets a color gradient of the fill color beginning with this color.

hi.fill

For a categorical variable and the resulting bubble plot, or a matrix of these plots, sets a color gradient of the fill color ending with this color.

proportion

Specify proportions, relative frequencies, instead of counts. For a two variable bubble chart, if TRUE then to facilitate group comparisons, displays the proportion of data values by fill variable within each group.

smooth

Smoothed density plot for two numerical variables. By default, set to TRUE for 2500 or more rows of data.

smooth.points

Number of points superimposed on the density plot in the areas of the lowest density to help identify outliers, which controls how dark are the smoothed points.

smooth.trans

Exponent of the function that maps the density scale to the color scale.

smooth.bins

Number of bins in both directions for the density estimation.

fit

The best fit line. Default value is FALSE, with options for "loess" and for least squares, indicated by "ls". Or, if set to TRUE, then a loess line. If potential outliers are identified according to out.cut, a second (dashed) fit line is displayed calculated without the outliers. Modify the color from the current theme with the style function parameter fit.color.

fit.se

Confidence level for the error band displayed around the line of best fit. The default value of 0 turns off the standard error plot. Can be a vector to display multiple ranges.

ellipse

If TRUE, enclose a scatterplot of only a single x-variable and a single y-variable with the default .95 data ellipse, the contours of the corresponding bivariate normal density function. Or, can specify a single or vector of numeric values greater than 0 and less than 1, to plot one or more specified ellipses. For Trellis graphics, only the maximum level applies with only one ellipse per panel. Modify fill and border colors from the current theme with the style function parameters ellipse.fill and ellipse.color.

bin

If TRUE, display the default frequency distribution for the text output of the Violin-Box-Scatter (VBS) Plot, or, if values is set to "count", a frequency polygon.

bin.start

Optional specified starting value of the bins for a frequency polygon or for the text output of a Violin-Box-Scatter (VBS) Plot. Also, sets bin to TRUE.

bin.width

Optional specified bin width value. Also, sets bin to TRUE.

bin.end

Optional specified value that is within the last bin, so the actual endpoint of the last bin may be larger than the specified value.

breaks

The method for calculating the bins, or an explicit specification of the bins, such as with the standard R seq function or other options provided by the hist function. Also, sets bin to TRUE.

cumul

Specify a cumulative frequency polygon.

run

If set to TRUE, generate a run chart, i.e., line chart, in which points are plotted in the sequential order of occurrence in the data table. By default, the points are connected by line segments to form a run chart. Set by default when the x-values are sorted with equal intervals or a single variable is a time series. Customize the color of the line segments with segments.color with function style.

lwd

Width of the line segments. Set to zero to remove the line segments.

area

If FALSE, no color is filled for the area under a plotted line from a run chart or time series. Usual default is FALSE, but default is TRUE if multiple time series are plotted on the same panel. Default fill color is bar.fill. Select a custom color with area.fill with function style.

area.origin

Origin for the filled area under the time series line. Values less than this value are below the corresponding reference line, values larger are above the line.

center.line

Plots a dashed line through the middle of a run chart. The two possible values for the line are "mean" and "median". Provides a center line for the "median" by default, when the values randomly vary about the mean. A value of "zero" specifies the center line should go through zero. Currently does not apply to Trellis plots.

show.runs

If TRUE, display the individual runs in the run analysis. Also, sets run to TRUE.

stack

If TRUE, multiple time plots are stacked on each other, with area set to TRUE by default.

add

Draw one or more objects, text or a geometric figures, on the plot. Possible values are any text to be written, the first argument, which is "text", or, to indicate a figure, "rect" (rectangle), "line", "arrow", "v.line" (vertical line), and "h.line" (horizontal line). The value "means" is short-hand for vertical and horizontal lines at the respective means. Does not apply to Trellis graphics. Customize with parameters such as add.fill and add.color from the style function.

First x coordinate to be considered for each object, can be "mean.x". Not used for "h.line".

First y coordinate to be considered for each object, can be "mean.y". Not used for"v.line".

Second x coordinate to be considered for each object, can be "mean.x". Only used for "rect", "line" and arrow.

Second y coordinate to be considered for each object, can be "mean.y". Only used for "rect", "line" and arrow.

xlab, ylab

Axis label for x-axis or y-axis. If not specified, then the label becomes the name of the corresponding variable label if it exists, or, if not, the variable name. If xy.ticks is FALSE, no ylab is displayed. Customize these and related parameters with parameters such as lab.color from the style function.

main

Label for the title of the graph. If the corresponding variable labels exist, then the title is set by default from the corresponding variable labels.

sub

Sub-title of graph, below xlab.

xy.ticks

Flag that indicates if tick marks and associated axis values on the axes are to be displayed. To rotate the axis values, use rotate.x, rotate.y, and offset from the style function.

value.labels

Labels for the x-axis on the graph to override existing data values, including factor levels. If the variable is a factor and value.labels is not specified (is NULL), then the value.labels are set to the factor levels with each space replaced by a new line character. If x and y-axes have the same scale, they also apply to the y-axis.

label.max

Maximum size of labels for the values of a categorical variable. Not a literal maximum as preserving unique values may require a larger number of characters than specified.

origin.x

Origin of x-axis. Particularly useful for plots of count, etc, where the origin will be zero by default, but can be modified. Otherwise the origin of the plot is based on the minimum value of x.

auto

For a two-variable scatterplot, if TRUE, automatically add the 0.95 data ellipse, labeling of outliers beyond a Mahalanobis distance of 6 from the ellipse center, the best-fitting least squares line of all the data, the best-fitting least squares line of the regular data without the outliers, and a horizontal and vertical line to represent the mean of each of the two variables.

digits.d

Number of significant digits for each of the displayed summary statistics.

quiet

If set to TRUE, no text output. Can change system default with style function.

do.plot

If TRUE, the default, then generate the plot.

width

Width of the plot window in inches, defaults to 5 except in RStudio to maintain an approximate square plotting area.

height

Height of the plot window in inches, defaults to 4.5 except for 1-D scatterplots and when in RStudio.

pdf.file

Indicate to direct pdf graphics to the specified name of the pdf file.

fun.call

Function call. Used with knitr to pass the function call when obtained from the abbreviated function call sp.

…

Other parameter values for non-Trellis graphics as defined by and processed by standard R functions plot and par, including xlim and ylim for setting the range of the x and y-axes cex.main for the size of the title col.main for the color of the title cex for the size of the axis value labels sub and col.sub for a subtitle and its color

Value

The output can optionally be saved into an R object, otherwise it simply appears in the console. The output here is just for the outlier analysis of the two-variable scatterplot with continuous variables. The outlier identification must be activated for the analysis, such as from parameter MD.cut.

READABLE OUTPUT codeout_outlier: Mahalanobis Distance of each outlier.

STATISTICS codeoutliers_indices: Location of the outliers in the x and y vectors.

Details

VARIABLES and TRELLIS PLOTS There is at least one primary variable, x, which defines the coordinate system for plotting in terms of the x-axis, the horizontal axis. Plots may also specify a second primary variable, y, which defines the y-axis of the coordinate system. One of these primary variables may be a vector. The simplest plot is from the specification of only one or two primary variables, each as a single variable, which generates a single scatterplot of either one or two variables, necessarily on a single plot, called a panel, defined by a single x-axis and usually a single y-axis.

For numeric primary variables, a single panel may also contain multiple scatterplots, of two types. Form the first type from subsets of observations (rows of data) based on values of a categorical variable. Specify this plot with the by parameter, which identifies the grouping variable to generate a scatterplot of the primary variables for each of its levels. The points for each group are plotted with a different shape and/or color. By default, the colors vary, though to maintain the color scheme, if there are only two levels of the grouping variable, the points for one level are filled with the current theme color and the points for the second level are plotted with transparent interiors.

Or, obtain multiple scatterplots on the same panel with multiple numeric x-variables, or multiple y-variables. To obtain this graph, specify one of the primary variables as a vector of multiple variables.

Trellis graphics, from Deepayan Sarkar's (2009) lattice package, may be implemented in which multiple panels for one numeric x-variable and one numeric y-variable are displayed according to the levels of one or two categorical variables, called conditioning variables. A variable specified with by is a conditioning variable that results in a Trellis plot, the scatterplot of x and y produced at each level of the by1 variable. The inclusion of a second conditioning variable, by2, results in a separate scatterplot panel for each combination of cross-classified values of both by1 and by2. A grouping variable according to by may also be specified, which is then applied to each panel.

Control the panel dimensions and the overall size of the Trellis plot with the following parameters: width and height for the physical dimensions of the plot window, n.row and n.col for the number of rows and columns of panels, and aspect for the ratio of the height to the width of each panel. The plot window is the standard graphics window that displays on the screen, or it can be specified as a pdf file with the pdf.file parameter.

CATEGORICAL VARIABLES Conceptually, there are continuous variables and categorical variables. Categorical variables have relatively few unique data values. However, categorical variables can be defined with non-numeric values, but also with numeric values, such as responses to a five-point Likert scale from Strongly Disagree to Strongly Agree, with responses coded 1 to 5. The three by-variables -- by1, by2 and by -- only apply to graphs created with numeric x and y variables, continuous or categorical.

The standard and most general way to define a categorical variable is as an R factor, illustrated in the examples for the Transform function. lessR provides the option to define an integer variable with equally spaced values as categorical based on the value of n.cat, which can be set locally or globally with the style function. For example, for a variable with data values from 5-point Likert scale, a value of n.cat of 5 will define the define the variable as categorical. The default value is 0. To explicitly analyze the values as categorical, set n.cat to a value larger than 0, at least the size of the number of unique integer values. Can also annotate a graph of the values of an integer categorical variable with value.labels option.

A scatterplot of Likert type data is problematic because there are so few possibilities for points in the scatterplot. For example, for a scatterplot of two five-point Likert response data, there are only 26 possible paired values to plot, so most of the plotted points overlap with others. In this situation, that is, when a single variable or two variables with Likert response scales are specified, a bubble plot is automatically provided, with the size of each point relative to the joint frequency of the paired data values. To request a sunflower plot in lieu of the bubble plot, set the shape to "sunflower".

DATA The default input data frame is mydata. Specify another name with the data option. Regardless of its name, the data frame need not be attached to reference the variables directly by its name, that is, no need to invoke the mydata$name notation. The referenced variables can be in the data frame and/or the user's workspace, the global environment.

The data values themselves can be plotted, or for a single variable, counts or proportions can be plotted on the y-axis. For a categorical x-variable paired with a continuous variable, means and other statistics can be plotted at each level of the x-variable. If x is continuous, it is binned first, with the standard Histogram binning parameters available, such as bin.width, to override default values. The values parameter sets the values to plot, with data the default. By default, the connecting line segments are provided, so a frequency polygon results. Turn off the lines by setting lwd=0.

VALUE LABELS The value labels for each axis can be over-ridden from their values in the data to user supplied values with the value.labels option. This option is particularly useful for Likert-style data coded as integers. Then, for example, a 0 in the data can be mapped into a "Strongly Disagree" on the plot. These value labels apply to integer categorical variables, and also to factor variables. To enhance the readability of the labels on the graph, any blanks in a value label translate into a new line in the resulting plot. Blanks are also transformed as such for the labels of factor variables.

VARIABLE LABELS Although standard R does not provide for variable labels, lessR can store the labels in the data frame with the data, obtained from the Read function or VariableLabels. If variable labels exist, then the corresponding variable label is by default listed as the label for the corresponding axis and on the text output.

ONE VARIABLE PLOT The one variable plot of one continuous variable generates either a violin/box/scatterplot (VBS plot), or a run chart with run=TRUE, or x can be an R time series variable for a time series chart. For the box plot, for gray scale output potential outliers are plotted with squares and outliers are plotted with diamonds, otherwise shades of red are used to highlight outliers. The default definition of outliers is based on the standard boxplot rule of values more than 1.5 IQR's from the box. The definition of outliers may be adjusted (Hubert and Vandervieren, 2008), such that the whiskers are computed from the medcouple index of skewness (Brys, Hubert, & Struyf, 2004).

The plot can also be obtained as a bubble plot of frequencies for a categorical variable.

TWO VARIABLE PLOT When two variables are specified to plot, by default if the values of the first variable, x, are unsorted, or if there are unequal intervals between adjacent values, or if there is missing data for either variable, a scatterplot is produced from a call to the standard R plot function. By default, sorted values with equal intervals between adjacent values of the first of the two specified variables yields a function plot if there is no missing data for either variable, that is, a call to the standard R plot function with type="l", which connects each adjacent pair of points with a line segment.

Specifying multiple, continuous x-variables against a single y variable, or vice versa, results in multiple plots on the same graph. The color of the points of the second variable is the same as that of the first variable, but with a transparent fill. For more than two x-variables, multiple colors are displayed, one for each x-variable.

BUBBLE PLOT FREQUENCY MATRIX (BPFM) Multiple categorical variables for x may be specified in the absence of a y variable. A bubble plot results that illustrates the frequency of each response for each of the variables in a common figure in which the x-axis contains all of the unique labels for all of the variables plotted. Each line of information, the bubbles and counts for a single variable, replaces the standard bar chart in a more compact display. Usually the most meaningful when each variable in the matrix has the same response categories, that is, levels, such as for a set of shared Likert scales. The BPFM is considerably condensed presentation of frequencies for a set of variables than are the corresponding bar charts.

SCATTERPLOT MATRIX A single vector of continuous variables specified as x, with no y-variable, generates a scatterplot matrix of the specified variable. A continuous variable is defined as a numeric variable with more than n.cat unique responses. To force an item with a small number of unique responses, such as from a 5-pt Likert scale, to be treated as continuous, set n.cat to a number lower than 5, such as n.cat=0 in the function call.

The scatterplot matrix is displayed according to the current color theme. Specific colors such as fill, color, etc. can also be provided. The upper triangle shows the correlation coefficient, and the lower triangle each corresponding scatterplot, with, by default, the non-linear loess best fit line. The code fit option can be used to provide the linear least squares line instead, along with the corresponding fit.color for the color of the fit line.

SIZE VARIABLE A variable specified with size= is a numerical variable that activates a bubble plot in which the size of each bubble is determined by the value of the corresponding value of size, which can be a variable or a constant.

To explicitly vary the shapes, use shape and a list of shape values in the standard R form with the c function to combine a list of values, one specified shape for each group, as shown in the examples. To explicitly vary the colors, use fill, such as with R standard color names. If fill is specified without shape, then colors are varied, but not shapes. To vary both shapes and colors, specify values for both options, always with one shape or color specified for each level of the by variable.

Shapes beyond the standard list of named shapes, such as "circle", are also available as single characters. Any single letter, uppercase or lowercase, any single digit, and the characters "+", "*" and "#" are available, as illustrated in the examples. In the use of shape, either use standard named shapes, or individual characters, but not both in a single specification.

SCATTERPLOT ELLIPSE For a scatterplot of two numeric variables, the ellipse=TRUE option draws the .95 data ellipse as computed by the ellipse function, written by Duncan Murdoch and E. D. Chow, from the ellipse package. The axes are automatically lengthened to provide space for the entire ellipse that extends beyond the maximum and minimum data values. The specific level of the ellipse can be specified with a numerical value in the form of a proportion. Multiple numerical values of ellipse may also be specified to obtain multiple ellipses.

TIME CHARTS Specifying one or more x-variables with no y-variables, and run=TRUE plots the x-variables in a run chart. The values of the specified x-variable are plotted on the y-axis, with Index on the x-axis. Index is the ordinal position of each data value, from 1 to the number of values.

If the specified x-variable is of type Date, or is a time series, a time series plot is generated for each specified variable. If a formal R time-series, univariate or multivariate, specify as the x-variable. Or, specify the x-variable of type Date, and then specify the y-variable as one or more time series to plot. The y-variable can be formatted as tidy data with all the values in a single column, or as wide-formatted data with the time-series variables in separate columns.

2-D KERNEL DENSITY With smooth=TRUE, the R function smoothScatter is invoked according to the current color theme. Useful for very large data sets. The smooth.points parameter plots points from the s of the lowest density. The smooth.bins parameter specifies the number of bins in both directions for the density estimation. The smooth.trans parameter specifies the exponent in the function that maps the density scale to the color scale to allow customization of the intensity of the plotted gradient colors. Higher values result in less color saturation, de-emphasizing points from regions of lessor density. These parameters are respectively passed directly to the smoothScatter nrpoints, nbin and transformation parameters. Grid lines are turned off, but can be displayed by setting the grid.color parameter.

COLORS A color theme for all the colors can be chosen for a specific plot with the colors option with the lessR function style. The default color theme is "lightbronze". A gray scale is available with "gray", and other themes are available as explained in style, such as "sienna" and "darkred". Use the option style(sub.theme="black") for a black background and partial transparency of plotted colors.

Colors can also be changed for individual aspects of a scatterplot as well with the style function. To provide a warmer tone by slightly enhancing red, try a background color such as panel.fill="snow". Obtain a very light gray with panel.fill="gray99". To darken the background gray, try panel.fill="gray97" or lower numbers. See the lessR function showColors, which provides an example of all available named R colors with their RGB values.

For the color options, such as violin.color, the value of "off" is the same as "transparent".

The default qualitative color chart is a re-arrangement of the colors of Set3 from Neuwirth's RColorBrewer package (2014).

ANNOTATIONS Use the add and related parameters to annotate the plot with text and/or geometric figures. Each object is placed according from one to four corresponding coordinates, the required coordinates to plot that object, as shown in the following table. x-coordinates may have the value of "mean.x" and y-coordinates may have the value of "mean.y".

Value	Object	Required Coordinates
-----------	-------------------	----------------
text	text	x1, x2
`"rect"`	rectangle	x1, y1, x2, y2
`"line"`	line segment	x1, y1, x2, y2
`"arrow"`	arrow	x1, y1, x2, y2
`"v.line"`	vertical line	x1
`"h.line"`	horizontal line	y1
`"means"`	horiz, vert lines
-----------	-------------------	----------------

The value of add specifies the object. For a single object, enter a single value. Then specify the value of the needed corresponding coordinates, as specified in the above table. For multiple placements of that object, specify vectors of corresponding coordinates. To annotate multiple objects, specify multiple values for add as a vector. Then list the corresponding coordinates, for up to each of four coordinates, in the order of the objects listed in add. See the examples for more explanation.

Can also specify vectors of different properties, such as add.color. That is, different objects can be different colors, different transparency levels, etc.

PDF OUTPUT To obtain pdf output, use the pdf.file option, perhaps with the optional width and height options. These files are written to the default working directory, which can be explicitly specified with the R setwd function.

ADDITIONAL OPTIONS Commonly used graphical parameters that are available to the standard R function plot are also generally available to Plot, such as:

cex.main, col.lab, font.sub, etc.: Settings for main- and sub-title and axis annotation, see title and par.
main: Title of the graph, see title.
xlim: The limits of the plot on the x-axis, expressed as c(x1,x2), where x1 and x2 are the limits. Note that x1 > x2 is allowed and leads to a reversed axis.
ylim: The limits of the plot on the y-axis.

ONLY VARIABLES ARE REFERENCED A referenced variable in a lessR function can only be a variable name. This referenced variable must exist in either the referenced data frame, such as the default mydata, or in the user's workspace, more formally called the global environment. That is, expressions cannot be directly evaluated. For example:

> Plot(rnorm(50), rnorm(50)) # does NOT work

Instead, do the following:

    > X <- rnorm(50)   # create vector X in user workspace
    > Y <- rnorm(50)   # create vector Y in user workspace
    > Plot(X,Y)     # directly reference X and Y

References

Brys, G., Hubert, M., & Struyf, A. (2004). A robust measure of skewness. Journal of Computational and Graphical Statistics, 13(4), 996-1017.

Erich Neuwirth (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1-2. https://CRAN.R-project.org/package=RColorBrewer

Murdoch, D, and Chow, E. D. (2013). ellipse function from the ellipse package package.

Gerbing, D. W. (2014). R Data Analysis without Programming, Chapter 8, NY: Routledge.

Hubert, M. and Vandervieren, E. (2008). An adjusted boxplot for skewed distributions, Computational Statistics and Data Analysis 52, 51865201.

Sarkar, Deepayan (2008) Lattice: Multivariate Data Visualization with R, Springer. http://lmdvr.r-forge.r-project.org/

Examples

Run this code

# NOT RUN {
# read the data
mydata <- rd("Employee", format="lessR", quiet=TRUE)
mydata <- Subset(random=.6, quiet=TRUE)  # less computationally intensive

#---------------------------------------------------
# traditional scatterplot with two numeric variables
#---------------------------------------------------

# scatterplot with all defaults
Plot(Years, Salary)
# or use abbreviation sp in place of Plot
# or use full expression ScatterPlot in place of Plot

# maximum information, minimum input: scatterplot +
#  outliers, ellipse, least-squares line with and w/o outliers, means
Plot(Years, Salary, auto=TRUE)

# plot 0.95 data ellipse with the points identified that represent
#   outliers defined by a Mahalanobis Distance larger than 6 
# save outliers into R object out, then remove from mydata
mydata[1, "Salary"] <- 200000
out <- Plot(Years, Salary, ellipse=0.95, MD.cut=6)
mydata <- mydata[-out$outlier_indices,]

# new shape and point size, no grid or background color
# then put style back to default
#style(panel.fill="powderblue", grid.color="off")
#Plot(Years, Salary, size=2, shape="diamond")
#style()

# translucent data ellipses without points or edges
#  show the idealized joint distribution for bivariate normality
#style(ellipse.color="off")
#Plot(Years, Salary, size=0, ellipse=seq(.1,.9,.10))

# bubble plot with size determined by the value of Pre
# display the value for the bubbles with values of  min, median and max
#Plot(Years, Salary, size=Pre, size.cut=3)

# variables of interest are in a data frame not the default mydata
# plot 0.6 and 0.9 data ellipses
# change color theme to gold with black background
#style("gold", sub.theme="black")
#Plot(eruptions, waiting, ellipse=seq(.6,.9), data=faithful)

# scatterplot with two x-variables, plotted against Salary
# define a new style, then back to default
#style(window.fill=rgb(247,242,230, maxColorValue=255),
# panel.fill="off", panel.color="off", pt.fill="black", trans=0,
# lab.color="black", axis.text.color="black",
# axis.y.color="off", grid.x.color="off", grid.y.color="black",
# grid.lty="dotted", grid.lwd=1)
#Plot(c(Pre, Post), Salary)
#style()

# increase span (smoothing) from default of .7 to 1.25
# span is a loess parameter, which generates a caution that can be
#   ignored that it is not a graphical parameter -- we know that
# display confidence intervals about best-fit line at
#   0.95 confidence level
#Plot(Years, Salary, fit="loess", span=1.25, fit.se=0.95)

# 2-D kernel density (more useful for larger sample sizes) 
#Plot(Years, Salary, smooth=TRUE)


#------------------------------------------------------
# scatterplot matrix from a vector of numeric variables
#------------------------------------------------------

# with least squares fit line
#Plot(c("Salary", "Years", "Pre"), fit="ls")


#--------------------------------------------------------------
# Trellis graphics and by for groups with two numeric variables
#--------------------------------------------------------------

# Trellis plot with condition on 1-variable
Plot(Years, Salary, by1=Dept)

# all three by variables
#Plot(Years, Salary, by1=Dept, by2=Gender, by=HealthPlan)

# vary both shape and color with a least-squares fit line for each group
#style(color=c("darkgreen", "brown"))
#Plot(Years, Salary, by1=Gender, fit="ls", shape=c("F","M"), size=.8)
#style("gray")

# compare the men and women Salary according to Years worked
#   with an ellipse for each group
#Plot(Years, Salary, by=Gender, ellipse=.50)


#--------------------------------------------------
# analysis of a single numeric variable (or vector)
#--------------------------------------------------

# One continuous variable
# -----------------------
# Integrated Violin/Box/Scatterplot, a VBS plot
#Plot(Salary)

# by variable, different colors for different values of the variable
# all on one panel
#Plot(Salary, by=Dept)

# large sample size
#x <- rnorm(10000)
#Plot(x)

# custom colors for outliers, which might not appear in this subset data
#style(out.fill="hotpink", out2.fill="purple")
#Plot(Salary)
#style()

# no violin plot, boxplot and scatterplot only
#Plot(x, vbs.plot="bs")

# binned values to plot counts
# ----------------------------
# bin the values of Salary to plot counts as a frequency polygon
# the counts are plotted as points instead of the data
#Plot(Salary, values="count")  # bin the values

# time charts
#------------
# run chart, with fill area
#Plot(Salary, run=TRUE, area=TRUE)

# two run charts in same plot
# or could do a multivariate time series
#Plot(c(Pre, Post), run=TRUE)

# Trellis graphics run chart with custom line width, no points
#Plot(Salary, run=TRUE, by1=Gender, lwd=3, size=0)

# daily time series plot
# create the daily time series from R built-in data set airquality
#oz.ts <- ts(airquality$Ozone, start=c(1973, 121), frequency=365)
#Plot(oz.ts)

# multiple time series plotted from dates and stacked
# black background with translucent areas, then reset theme to default
#style(sub.theme="black", color="steelblue2", trans=.55, 
#   window.fill="gray10", grid.color="gray25")
#date <- seq(as.Date("2013/1/1"), as.Date("2016/1/1"), by="quarter")
#x1 <- rnorm(13, 100, 15)
#x2 <- rnorm(13, 100, 15)
#x3 <- rnorm(13, 100, 15)
#df <- data.frame(date, x1, x2, x3)
#Plot(date, x1:x3, data=df)
#style()


#------------------------------------------
# analysis of a single categorical variable
#------------------------------------------

# default 1-D bubble plot
# frequency plot, replaces bar chart 
Plot(Dept)

# abbreviated category labels
#Plot(Dept, label.max=2)

# plot of frequencies for each category (level), replaces bar chart 
#Plot(Dept, values="count")


#----------------------------------------------------
# scatterplot of numeric against categorical variable 
#----------------------------------------------------

# generate a chart with the plotted mean of each level
# rotate x-axis labels and then offset to fit
#style(rotate.x=45, offset=1)
#Plot(Dept, Salary)
#style()


#-------------------
# Cleveland dot plot 
#-------------------

# row.names on the y-axis
Plot(Salary, row.names)

# standard scatterplot
#Plot(Salary, row.names, sort.yx=FALSE, segments.y=FALSE)

# Cleveland dot plot with two x-variables
#Plot(c(Pre, Post), row.names)


#------------
# annotations
#------------

# add text at the one location specified by x1 and x2
#Plot(Years, Salary, add="Hi There", x1=12, y1=80000)

# add text at three different specified locations 
#Plot(Years, Salary, add="Hi", x1=c(12, 16, 18), y1=c(80000, 100000, 60000))

# add three different text blocks at three different specified locations
#Plot(Years, Salary, add=c("Hi", "Bye", "Wow"), x1=c(12, 16, 18),
#  y1=c(80000, 100000, 60000))

# add an 0.95 data ellipse and horizontal and vertical lines through the
#  respective means
#Plot(Years, Salary, ellipse=TRUE, add=c("v.line", "h.line"),
#  x1="mean.x", y1="mean.y")
# can be done also with the following short-hand
#Plot(Years, Salary, ellipse=TRUE, add=c("means"))
 
# a rectangle requires two points, <x1,y1> and <x2,y2>
#style(add.trans=.8, add.fill="gold", add.color="gold4", add.lwd=0.5)
#Plot(Years, Salary, add="rect", x1=12, y1=80000, x2=16, y2=115000)

# the first object, a rectangle, requires all four coordinates
# the vertical line at x=2 requires only an x1 coordinate, listed 2nd 
#Plot(Years, Salary, add=c("rect", "v.line"), x1=c(10, 2),
#  y1=80000, x2=12, y2=115000)

# two different rectangles with different locations, fill colors and translucence
#style(add.fill=c("gold3", "green"), add.trans=c(.8,.4))
#Plot(Years, Salary, add=c("rect", "rect"), 
#  x1=c(10, 2), y1=c(60000, 45000), x2=c(12, 75000), y2=c(80000, 55000))


#----------------------------------------------------
# analysis of two categorical variables (Likert data)
#----------------------------------------------------

mydata <- rd("Mach4", format="lessR", quiet=TRUE)  # Likert data, 0 to 5
mydata <- Subset(random=.5, quiet=TRUE)  # less computationally intensive

# size of each plotted point (bubble) depends on its joint frequency
# triggered by default when replication of joint values and
#   less than 9 unique data values for each 
Plot(m06, m07)

# use value labels for the integer values, modify color options
#LikertCats <- c("Strongly Disagree", "Disagree", "Slightly Disagree",
#   "Slightly Agree", "Agree", "Strongly Agree")
#style(fill="powderblue", color="blue", bubble.text="darkred")
#Plot(m06,  m07, value.labels=LikertCats)
#style("darkred")  # reset theme

# get correlation analysis instead of cross-tab analysis:
#   maximum number of categories of equally spaced integer values
#   to define a variable as categorical here specified as 0
#Plot(m06, m07, n.cat=0)

# proportions within each level of the other variable
#Plot(m06, m07, proportion=TRUE)


#-----------------------------
# Bubble Plot Frequency Matrix
#-----------------------------

#Plot(c(m06,m07,m09,m10), value.labels=LikertCats)


#---------------
# function curve
#---------------

#x <- seq(10,50,by=2) 
#y1 <- sqrt(x)
#y2 <- x**.33
# x is sorted with equal intervals so run chart by default
#Plot(x, y1)
# custom function plot
#style(panel.fill="snow", area.fill="lightsteelblue")
#Plot(x, y1, ylab="My Y", xlab="My X", main="My Curve")
#style()

# multiple plots, need data frame
#mydata <- data.frame(x, y1, y2)
#Plot(x, c(y1, y2))



#-----------
# modern art
#-----------

#clr <- colors()
#color0 <- clr[sample(1:length(clr), size=1)]
#clr <- clr[-(153:353)]  # get rid of most of the grays
#n <- sample(4:30, size=1)
#x <- rnorm(n)
#y <- rnorm(n)
#color1 <- clr[sample(1:length(clr), size=1)]
#color2 <- clr[sample(1:length(clr), size=1)]
#style(window.fill=color0, area.fill=color1, color=color2)
#Plot(x, y, run=TRUE, 
# xy.ticks=FALSE, main="Modern Art", xlab="", ylab="",
# cex.main=2, col.main="lightsteelblue", n.cat=0, center.line="off")
#style() # reset style to default
# }

Run the code above in your browser using DataLab