tableplot: Create a tableplot

Description

A tableplot is a visualisation of (large) multivariate datasets. Each column represents a variable and each row bin is an aggregate of a certain number of records. For numeric variables, a bar chart of the mean values is depicted. For categorical variables, a stacked bar chart is depicted of the proportions of categories. Missing values are taken into account. Also supports large ffdf datasets from the ff package. For a quick intro, see vignette("tabplot-vignette").

Usage

tableplot(dat, select, subset = NULL, sortCol = 1, decreasing = TRUE,
  nBins = 100, from = 0, to = 100, nCols = ncol(dat),
  sample = FALSE, sampleBinSize = 1000, scales = "auto",
  numMode = "mb-sdb-ml", max_levels = 50, pals = list("Set1", "Set2",
  "Set3", "Set4"), change_palette_type_at = 20, rev_legend = FALSE,
  colorNA = "#FF1414", colorNA_num = "gray75", numPals = "OrBu",
  limitsX = NULL, bias_brokenX = 0.8, IQR_bias = 5,
  select_string = NULL, subset_string = NULL, colNames = NULL,
  filter = NULL, plot = TRUE, ...)

Arguments

dat

a data.frame, an ffdf object, or an object created by tablePrepare (see details below). Required.

select

expression indicating the columns of dat that are visualized in the tablelplot Also column indices are supported. By default, all columns are visualized. Use select_string for character strings instead of expressions.

subset

logical expression indicing which rows to select in dat (as in subset). It is also possible to provide the name of a categorical variable: then, a tableplot for each category is generated. Use subset_string for character strings instead of an expressions.

sortCol

column name on which the dataset is sorted. It can be an index, expression name, or a character string. PS: in case of ambiguity, the character string is used like in this example: Sepal.Width <- "Petal.Width"; tableplot(iris, sortCol=Sepal.Width).

decreasing

boolean that determines whether the dataset is sorted decreasingly (TRUE) of increasingly (FALSE).

nBins

number of row bins

from

percentage from which the sorted data is shown

percentage to which the sorted data is shown

nCols

the maximum number of columns per tableplot. If this number is smaller than the number of columns selected in datNames, multiple tableplots are generated, where each of them contains the sorted column(s).

sample

boolean that determines whether to sample or use the whole data. Only useful when tablePrepare is used.

sampleBinSize

the number of sampled objects per bin, if sample is TRUE.

scales

determines the horizontal axes of the numeric variables in select. Options: "lin", "log", and "auto" for automatic detection. Either scale is a named vector, where the names correspond to numerical variable names, or scale is unnamed, where the values are applied to all numeric variables (recycled if necessary).

numMode

character value that determines how numeric values are plotted. The value consists of the following building blocks, which are concatenated with the "-" symbol. The default value is "mb-sdb-sdl". Prior to version 1.2, "MB-ML" was the default value.

sdb: sd bars between mean-sd to mean+sd are shown
sdl: sd lines at mean-sd and mean+sd are shown
mb: mean bars are shown
MB: mean bars are shown, where the color of the bar indicate completeness where positive mean values are blue and negative orange
ml: mean lines are shown
ML: mean lines are shown, where positive mean values are blue and negative orange
mean2: mean values are shown

max_levels

maximum number of levels for categorical variables. Categorical variables with more levels will be rebinned into max_levels levels. Either a positive number or -1, which means that categorical variables are never rebinned.

pals

list of color palettes. Each list item is on of the following:

a palette name of tablePalettes, optionally with the starting color between brackets.
a color vector

If the list items are unnamed, they are applied to all selected categorical variables (recycled if necessary). The list items can be assigned to specific categorical variables, by naming them accordingly.

change_palette_type_at

number at which the type of categorical palettes is changed. For categorical variables with less than change_palette_type_at levels, the palette is recycled if necessary. For categorical variables with change_palette_type_at levels or more, a new palette of interpolated colors is derived (like a rainbow palette).

rev_legend

logical value or vector that determines which legends are reversed. If a vector is provided, the names of the items should the names of (a selection of) the categorical variables.

colorNA

color for missing values for categorical variables.

colorNA_num

color for missing values for numeric variables. It is used when all values in a bin are missing. If a part of the values are missing, a brighter color is used (see argument numPals).

numPals

vector of palette names that are used for numeric variables. These names are chosen from the diverging palette names in tablePalettes. Either numPals is a named vector, where the names correspond to the numerical variable names, or an unnamed vector (recycled if necessary). A "-" prefix in the name reverses the palette. When sd bars are shown (see the argument numMode of plot), only the righthand-side of the palette is used, where brightness is used to differentiate between mean bar and sd bar. When sd bars are not shown (the default in versions before 1.2), the righthand-side of the palette is used for positive mean values, and the lefthand-side for negative mean values. The brightness of the color is determined by the fraction of missing values.

limitsX

a list of vectors of length two, where each vector contains a lower and an upper limit value. Either the names of limitsX correspond to numerical variable names, or limitsX is an unnamed list (recycled if necessary).

bias_brokenX

parameter between 0 en 1 that determines when the x-axis of a numeric variable is broken. If minimum value is at least bias_brokenX times the maximum value, then X axis is broken. To turn off broken x-axes, set bias_brokenX=1.

IQR_bias

parameter that determines when a logarithmic scale is used when scales is set to "auto". The argument IQR_bias is multiplied by the interquartile range as a test.

select_string

character equivalent of the select argument (particularly useful for programming purposes)

subset_string

character equivalent of the subset argument (particularly useful for programming purposes)

colNames

deprecated; used in older versions of tabplot (prior to 0.12): use select_string instead

filter

deprecated; used in older versions of tabplot (prior to 0.12): use subset_string instead

plot

boolean, to plot or not to plot a tableplot

...

layout arguments, such as fontsize and title, are passed on to plot

Value

tabplot-object (silent output). If multiple tableplots are generated (which can be done by either setting subset to a categorical column name, or by restricting the number of columns with nCols), then a list of tabplot-objects is silently returned.

Details

For large dataset, we recommend to use tablePrepare which does all the necessary preprocessing that are needed to make any tableplot of the particular dataset. The resulting object of this function is passed on to tableplot (argument dat). Now tableplotting is very fast, and even faster with sampling enabled (sample=TRUE).

References

Tennekes, M., Jonge, E. de, Daas, P.J.H. (2013) Visualizing and Inspecting Large Datasets with Tableplots, Journal of Data Science 11 (1), 43-58

Examples

Run this code

# NOT RUN {
# load diamonds dataset from ggplot2
require(ggplot2)
data(diamonds)

# default tableplot
tableplot(diamonds)

# prior to verison 1.2, the mean values of numeric variables are displayed 
# without standard deviation (see ?plot.tabplot):
tableplot(diamonds, numMode = "MB-ML")

# most expensive diamonds
tableplot(diamonds, 
		  select=c(carat, cut, color, clarity, price), 
		  sortCol=price, 
		  from=0, 
		  to=5)

# for large datasets, we recommend to preprocess the data with tablePrepare:
p <- tablePrepare(diamonds)

# specific subsetting
tableplot(p, subset=price < 5000 & cut=='Ideal')

# change palettes
tableplot(p, 
		  pals=list(cut="Set4", color="Paired", clarity=grey(seq(0, 1,length.out=7))),
		  numPals=c(carat="PRGn", price="BrBG"))

# create a tableplot cut category, and fix scale limits of carat, table, and price
tabs <- tableplot(p, subset=cut,
	limitsX=list(carat=c(0,4), table=c(55, 65), price=c(0, 20000)), plot=FALSE)
plot(tabs[[3]], title="Very good cut diamonds")

# }

Run the code above in your browser using DataLab