A tableplot is a visualisation of (large) multivariate datasets. Each column represents a variable and each row bin is an aggregate of a certain number of records. For numeric variables, a bar chart of the mean values is depicted. For categorical variables, a stacked bar chart is depicted of the proportions of categories. Missing values are taken into account. Also supports large ffdf
datasets from the ff
package. For a quick intro, see vignette("tabplot-vignette")
.
tableplot(dat, select, subset = NULL, sortCol = 1, decreasing = TRUE,
nBins = 100, from = 0, to = 100, nCols = ncol(dat),
sample = FALSE, sampleBinSize = 1000, scales = "auto",
numMode = "mb-sdb-ml", max_levels = 50, pals = list("Set1", "Set2",
"Set3", "Set4"), change_palette_type_at = 20, rev_legend = FALSE,
colorNA = "#FF1414", colorNA_num = "gray75", numPals = "OrBu",
limitsX = NULL, bias_brokenX = 0.8, IQR_bias = 5,
select_string = NULL, subset_string = NULL, colNames = NULL,
filter = NULL, plot = TRUE, ...)
a data.frame
, an ffdf
object, or an object created by tablePrepare
(see details below). Required.
expression indicating the columns of dat
that are visualized in the tablelplot Also column indices are supported. By default, all columns are visualized. Use select_string
for character strings instead of expressions.
logical expression indicing which rows to select in dat
(as in subset
). It is also possible to provide the name of a categorical variable: then, a tableplot for each category is generated. Use subset_string
for character strings instead of an expressions.
column name on which the dataset is sorted. It can be an index, expression name, or a character string. PS: in case of ambiguity, the character string is used like in this example: Sepal.Width <- "Petal.Width"; tableplot(iris, sortCol=Sepal.Width)
.
boolean that determines whether the dataset is sorted decreasingly (TRUE
) of increasingly (FALSE
).
number of row bins
percentage from which the sorted data is shown
percentage to which the sorted data is shown
the maximum number of columns per tableplot. If this number is smaller than the number of columns selected in datNames
, multiple tableplots are generated, where each of them contains the sorted column(s).
boolean that determines whether to sample or use the whole data. Only useful when tablePrepare
is used.
the number of sampled objects per bin, if sample
is TRUE
.
determines the horizontal axes of the numeric variables in select
. Options: "lin", "log", and "auto" for automatic detection. Either scale
is a named vector, where the names correspond to numerical variable names, or scale
is unnamed, where the values are applied to all numeric variables (recycled if necessary).
character value that determines how numeric values are plotted. The value consists of the following building blocks, which are concatenated with the "-" symbol. The default value is "mb-sdb-sdl". Prior to version 1.2, "MB-ML" was the default value.
sdb
sd bars between mean-sd to mean+sd are shown
sdl
sd lines at mean-sd and mean+sd are shown
mb
mean bars are shown
MB
mean bars are shown, where the color of the bar indicate completeness where positive mean values are blue and negative orange
ml
mean lines are shown
ML
mean lines are shown, where positive mean values are blue and negative orange
mean2
mean values are shown
maximum number of levels for categorical variables. Categorical variables with more levels will be rebinned into max_levels
levels. Either a positive number or -1, which means that categorical variables are never rebinned.
list of color palettes. Each list item is on of the following:
a palette name of tablePalettes
, optionally with the starting color between brackets.
a color vector
If the list items are unnamed, they are applied to all selected categorical variables (recycled if necessary). The list items can be assigned to specific categorical variables, by naming them accordingly.
number at which the type of categorical palettes is changed. For categorical variables with less than change_palette_type_at
levels, the palette is recycled if necessary. For categorical variables with change_palette_type_at
levels or more, a new palette of interpolated colors is derived (like a rainbow palette).
logical value or vector that determines which legends are reversed. If a vector is provided, the names of the items should the names of (a selection of) the categorical variables.
color for missing values for categorical variables.
color for missing values for numeric variables. It is used when all values in a bin are missing. If a part of the values are missing, a brighter color is used (see argument numPals
).
vector of palette names that are used for numeric variables. These names are chosen from the diverging palette names in tablePalettes
. Either numPals
is a named vector, where the names correspond to the numerical variable names, or an unnamed vector (recycled if necessary). A "-" prefix in the name reverses the palette. When sd bars are shown (see the argument numMode
of plot
), only the righthand-side of the palette is used, where brightness is used to differentiate between mean bar and sd bar. When sd bars are not shown (the default in versions before 1.2), the righthand-side of the palette is used for positive mean values, and the lefthand-side for negative mean values. The brightness of the color is determined by the fraction of missing values.
a list of vectors of length two, where each vector contains a lower and an upper limit value. Either the names of limitsX
correspond to numerical variable names, or limitsX
is an unnamed list (recycled if necessary).
parameter between 0 en 1 that determines when the x-axis of a numeric variable is broken. If minimum value is at least bias_brokenX
times the maximum value, then X axis is broken. To turn off broken x-axes, set bias_brokenX=1
.
parameter that determines when a logarithmic scale is used when scales
is set to "auto". The argument IQR_bias
is multiplied by the interquartile range as a test.
character equivalent of the select
argument (particularly useful for programming purposes)
character equivalent of the subset
argument (particularly useful for programming purposes)
deprecated; used in older versions of tabplot (prior to 0.12): use select_string
instead
deprecated; used in older versions of tabplot (prior to 0.12): use subset_string
instead
boolean, to plot or not to plot a tableplot
layout arguments, such as fontsize
and title
, are passed on to plot
tabplot-object
(silent output). If multiple tableplots are generated (which can be done by either setting subset
to a categorical column name, or by restricting the number of columns with nCols
), then a list of tabplot-objects
is silently returned.
For large dataset, we recommend to use tablePrepare
which does all the necessary preprocessing that are needed to make any tableplot of the particular dataset. The resulting object of this function is passed on to tableplot
(argument dat
). Now tableplotting is very fast, and even faster with sampling enabled (sample=TRUE
).
# NOT RUN {
# load diamonds dataset from ggplot2
require(ggplot2)
data(diamonds)
# default tableplot
tableplot(diamonds)
# prior to verison 1.2, the mean values of numeric variables are displayed
# without standard deviation (see ?plot.tabplot):
tableplot(diamonds, numMode = "MB-ML")
# most expensive diamonds
tableplot(diamonds,
select=c(carat, cut, color, clarity, price),
sortCol=price,
from=0,
to=5)
# for large datasets, we recommend to preprocess the data with tablePrepare:
p <- tablePrepare(diamonds)
# specific subsetting
tableplot(p, subset=price < 5000 & cut=='Ideal')
# change palettes
tableplot(p,
pals=list(cut="Set4", color="Paired", clarity=grey(seq(0, 1,length.out=7))),
numPals=c(carat="PRGn", price="BrBG"))
# create a tableplot cut category, and fix scale limits of carat, table, and price
tabs <- tableplot(p, subset=cut,
limitsX=list(carat=c(0,4), table=c(55, 65), price=c(0, 20000)), plot=FALSE)
plot(tabs[[3]], title="Very good cut diamonds")
# }
Run the code above in your browser using DataLab