Learn R Programming

⚠️There's a newer version (2.1.5) of this package.Take me there.

output: html_document: keep_md: yes

skimr

## Dev mode: ON

The goal of skimr is to provide a frictionless approach to dealing with summary statistics iteratively and interactively as part of a pipeline, and that conforms to the principle of least surprise.

Skimr provides summary statistics that you can skim quickly to understand and your data and see what may be missing. It handles different data types (numerics, factors, etc), and returns a skimr object that can be piped or displayed nicely for the human reader.

Installation

# install.packages("devtools")
devtools::install_github("ropenscilabs/skimr")

To install the version with the most recent changes that have not yet been incorporated in the master branch (and may not be):

devtools::install_github("ropenscilabs/skimr", ref = "develop")

Skim statistics in the console

  • added missing, complete, n, sd
  • reports numeric/int/double separately from factor/chr
  • handles dates, logicals
  • supports spark-bar and spark-line based on

Hadley Wickham's pillar package.

Nicely separates variables by class:

skim(chickwts)
## Skim summary statistics
##  n obs: 71 
##  n variables: 2 
## 
## Variable type: factor 
##   variable missing complete  n n_unique                         top_counts ordered
## 1     feed       0       71 71        6 soy: 14, cas: 12, lin: 12, sun: 12   FALSE
## 
## Variable type: numeric 
##   variable missing complete  n   mean    sd min   p25 median   p75 max     hist
## 1   weight       0       71 71 261.31 78.07 108 204.5    258 323.5 423 ▃▅▅▇▃▇▂▂

Presentation is in a compact horizontal format:

skim(iris)
## Skim summary statistics
##  n obs: 150 
##  n variables: 5 
## 
## Variable type: factor 
##   variable missing complete   n n_unique                       top_counts ordered
## 1  Species       0      150 150        3 set: 50, ver: 50, vir: 50, NA: 0   FALSE
## 
## Variable type: numeric 
##       variable missing complete   n mean   sd min p25 median p75 max     hist
## 1 Petal.Length       0      150 150 3.76 1.77 1   1.6   4.35 5.1 6.9 ▇▁▁▂▅▅▃▁
## 2  Petal.Width       0      150 150 1.2  0.76 0.1 0.3   1.3  1.8 2.5 ▇▁▁▅▃▃▂▂
## 3 Sepal.Length       0      150 150 5.84 0.83 4.3 5.1   5.8  6.4 7.9 ▂▇▅▇▆▅▂▂
## 4  Sepal.Width       0      150 150 3.06 0.44 2   2.8   3    3.3 4.4 ▁▂▅▇▃▂▁▁

Individual columns of a data frame can be selected using tidyverse style selectors.

skim(iris, Sepal.Length, Petal.Length)
## Skim summary statistics
##  n obs: 150 
##  n variables: 5 
## 
## Variable type: numeric 
##       variable missing complete   n mean   sd min p25 median p75 max     hist
## 1 Petal.Length       0      150 150 3.76 1.77 1   1.6   4.35 5.1 6.9 ▇▁▁▂▅▅▃▁
## 2 Sepal.Length       0      150 150 5.84 0.83 4.3 5.1   5.8  6.4 7.9 ▂▇▅▇▆▅▂▂

Handles grouped data

Skim() can handle data that has been grouped using dplyr::group_by.

iris %>% dplyr::group_by(Species) %>% skim()
## Skim summary statistics
##  n obs: 150 
##  n variables: 5 
##  group variables: Species 
## 
## Variable type: numeric 
##       Species     variable missing complete  n mean   sd min  p25 median  p75 max     hist
## 1      setosa Petal.Length       0       50 50 1.46 0.17 1   1.4    1.5  1.58 1.9 ▁▁▅▇▇▅▂▁
## 2      setosa  Petal.Width       0       50 50 0.25 0.11 0.1 0.2    0.2  0.3  0.6 ▂▇▁▂▂▁▁▁
## 3      setosa Sepal.Length       0       50 50 5.01 0.35 4.3 4.8    5    5.2  5.8 ▂▃▅▇▇▃▁▂
## 4      setosa  Sepal.Width       0       50 50 3.43 0.38 2.3 3.2    3.4  3.68 4.4 ▁▁▃▅▇▃▂▁
## 5  versicolor Petal.Length       0       50 50 4.26 0.47 3   4      4.35 4.6  5.1 ▁▃▂▆▆▇▇▃
## 6  versicolor  Petal.Width       0       50 50 1.33 0.2  1   1.2    1.3  1.5  1.8 ▆▃▇▅▆▂▁▁
## 7  versicolor Sepal.Length       0       50 50 5.94 0.52 4.9 5.6    5.9  6.3  7   ▃▂▇▇▇▃▅▂
## 8  versicolor  Sepal.Width       0       50 50 2.77 0.31 2   2.52   2.8  3    3.4 ▁▂▃▅▃▇▃▁
## 9   virginica Petal.Length       0       50 50 5.55 0.55 4.5 5.1    5.55 5.88 6.9 ▂▇▃▇▅▂▁▂
## 10  virginica  Petal.Width       0       50 50 2.03 0.27 1.4 1.8    2    2.3  2.5 ▂▁▇▃▃▆▅▃
## 11  virginica Sepal.Length       0       50 50 6.59 0.64 4.9 6.23   6.5  6.9  7.9 ▁▁▃▇▅▃▂▃
## 12  virginica  Sepal.Width       0       50 50 2.97 0.32 2.2 2.8    3    3.18 3.8 ▁▃▇▇▅▃▁▂

Options for kable and pander

Enhanced print options are available by piping to kable() or pander().

skim_df object (long format)

By default skim prints beautifully in the console, but it also produces a long, tidy-format skim_df object that can be computed on.

a <-  skim(chickwts)
dim(a)
## [1] 23  6
print.data.frame(skim(chickwts))
##    variable    type       stat     level    value formatted
## 1    weight numeric    missing      .all   0.0000         0
## 2    weight numeric   complete      .all  71.0000        71
## 3    weight numeric          n      .all  71.0000        71
## 4    weight numeric       mean      .all 261.3099    261.31
## 5    weight numeric         sd      .all  78.0737     78.07
## 6    weight numeric        min      .all 108.0000       108
## 7    weight numeric        p25      .all 204.5000     204.5
## 8    weight numeric     median      .all 258.0000       258
## 9    weight numeric        p75      .all 323.5000     323.5
## 10   weight numeric        max      .all 423.0000       423
## 11   weight numeric       hist      .all       NA  ▃▅▅▇▃▇▂▂
## 12     feed  factor    missing      .all   0.0000         0
## 13     feed  factor   complete      .all  71.0000        71
## 14     feed  factor          n      .all  71.0000        71
## 15     feed  factor   n_unique      .all   6.0000         6
## 16     feed  factor top_counts   soybean  14.0000   soy: 14
## 17     feed  factor top_counts    casein  12.0000   cas: 12
## 18     feed  factor top_counts   linseed  12.0000   lin: 12
## 19     feed  factor top_counts sunflower  12.0000   sun: 12
## 20     feed  factor top_counts  meatmeal  11.0000   mea: 11
## 21     feed  factor top_counts horsebean  10.0000   hor: 10
## 22     feed  factor top_counts      <NA>   0.0000     NA: 0
## 23     feed  factor    ordered      .all   0.0000     FALSE

Compute on the full skim_df object

skim(mtcars) %>% dplyr::filter(stat=="hist")
## # A tibble: 11 x 6
##    variable    type  stat level value formatted
##       <chr>   <chr> <chr> <chr> <dbl>     <chr>
##  1      mpg numeric  hist  .all    NA  ▃▇▇▇▃▂▂▂
##  2      cyl numeric  hist  .all    NA  ▆▁▁▃▁▁▁▇
##  3     disp numeric  hist  .all    NA  ▇▆▁▂▅▃▁▂
##  4       hp numeric  hist  .all    NA  ▃▇▃▅▂▃▁▁
##  5     drat numeric  hist  .all    NA  ▃▇▁▅▇▂▁▁
##  6       wt numeric  hist  .all    NA  ▃▃▃▇▆▁▁▂
##  7     qsec numeric  hist  .all    NA  ▃▂▇▆▃▃▁▁
##  8       vs numeric  hist  .all    NA  ▇▁▁▁▁▁▁▆
##  9       am numeric  hist  .all    NA  ▇▁▁▁▁▁▁▆
## 10     gear numeric  hist  .all    NA  ▇▁▁▆▁▁▁▂
## 11     carb numeric  hist  .all    NA  ▆▇▂▇▁▁▁▁

Works with strings, lists and other column classes.

skim(dplyr::starwars)
## Skim summary statistics
##  n obs: 87 
##  n variables: 13 
## 
## Variable type: character 
##     variable missing complete  n min max empty n_unique
## 1  eye_color       0       87 87   3  13     0       15
## 2     gender       3       84 87   4  13     0        4
## 3 hair_color       5       82 87   4  13     0       12
## 4  homeworld      10       77 87   4  14     0       48
## 5       name       0       87 87   3  21     0       87
## 6 skin_color       0       87 87   3  19     0       31
## 7    species       5       82 87   3  14     0       37
## 
## Variable type: integer 
##   variable missing complete  n   mean    sd min p25 median p75 max     hist
## 1   height       6       81 87 174.36 34.77  66 167    180 191 264 ▁▁▁▂▇▃▁▁
## 
## Variable type: list 
##    variable missing complete  n n_unique min_length median_length max_length
## 1     films       0       87 87       24          1             1          7
## 2 starships       0       87 87       17          0             0          5
## 3  vehicles       0       87 87       11          0             0          2
## 
## Variable type: numeric 
##     variable missing complete  n  mean     sd min  p25 median  p75  max     hist
## 1 birth_year      44       43 87 87.57 154.69   8 35       52 72    896 ▇▁▁▁▁▁▁▁
## 2       mass      28       59 87 97.31 169.46  15 55.6     79 84.5 1358 ▇▁▁▁▁▁▁▁

Users can add new classes.

Specify your own statistics

funs <- list(iqr = IQR,
    quantile = purrr::partial(quantile, probs = .99))
  skim_with(numeric = funs, append = FALSE)
  skim(iris, Sepal.Length)
## Skim summary statistics
##  n obs: 150 
##  n variables: 5 
## 
## Variable type: numeric 
##       variable iqr quantile
## 1 Sepal.Length 1.3      7.7
# Restore defaults
  skim_with_defaults()

Limitations of current version

We are aware that there are issues with rendering the inline histograms and line charts in various contexts, some of which are described below.

Windows support for spark histograms

Windows cannot print the spark-histogram characters when printing a data-frame. For example, "▂▅▇" is printed as "<U+2582><U+2585><U+2587>". This longstanding problem originates in the low-level code for printing dataframes. One workaround for showing these characters in Windows is to set the CTYPE part of your locale to Chinese/Japanese/Korean with Sys.setlocale("LC_CTYPE", "Chinese"). These values do show up by default when printing a data-frame created by skim() as a list (as.list()) or as a matrix (as.matrix()).

Printing spark histograms and line graphs in knitted documents

Spark-bar and spark-line work in the console but may not work when you knit them to a specific document format. The same session that produces a correctly rendered HTML document may produce an incorrectly rendered PDF, for example. This issue can generally be addressed by changing fonts to one with good building block (for histograms) and braille support (for line graphs). For example, the open font "DejaVu Sans" from the extra font package supports these. You may also want to try wrapping your results in knitr::kable(). Please see the vignette on using fonts for details on this.

Displays in documents of different types will vary. For example, one user found that the font "Yu Gothic UI Semilight" produced consistent results for Microsoft Word and Libre Office Write.

Contributing

We welcome issue reports and pull requests including potentially adding support for different variable classes. Please see the contributing.md document.

Copy Link

Version

Install

install.packages('skimr')

Monthly Downloads

47,104

Version

1.0

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Last Published

December 21st, 2017

Functions in skimr (1.0)

inline_hist

Generate inline histogram for numeric variables
inline_linegraph

Generate inline line graph for time series variables
n_missing

Calculate missing values
n_unique

Calculate the number of unique elements but remove NA
list_max_length

Get the length of the longest list in a vector of lists
list_min_length

Get the length of the shortest list in a vector of lists
n_complete

Calculate complete values
n_empty

Calculate the number of blank values in a character vector
skim_to_wide

Print skim result and return a single wide data frame of summary statistics
skim_with

Set or add the summary functions for a particular type of data
spark_bar

Draw a sparkline bar graph with unicode block characters
list_lengths_median

Get the median length of the lists
list_lengths_min

Get the length of the shortest list in a vector of lists
pander.summary_skim_df

Pander method for a summary_skim_df object.
kable

Create kable object
kable.skim_df

Produce kable output of a skimmed data frame
ts_start

Get the start for a time series without the frequency
kable.summary_skim_df

Kable method for a summary_skim_df object.
list_lengths_max

Get the maximum length of the lists
print.skim_df

Print skimmed data frame
print.skim_vector

Manages print for skim_vector objects.
pander.skim_df

Produce pander output of a skimmed data frame
spark_line

Draw a sparkline line graph with Braille characters.
show_formats

Show formatting options currently used, by data type
show_skimmers

Working with summary functions currently used, by data type
summary.skim_df

Summary function for skim_df. This is a method of the generic function summary
ts_end

Get the finish for a time series without the frequency
skim_tee

Print useful summary statistic from a data frame returning the data frame without modification
skim_to_list

Print skim result and return a list of tibbles
max_char

Calculate the maximum number of characters within a character vector
min_char

Calculate the minimum number of characters within a character vector
skim

Get useful summary statistic from a data frame
skim_format

Change the formatting options for printed skim objects
print.summary_skim_df

Print method for a summary_skim_df object. This is a method for the generic function print
reexports

Objects exported from other packages
skimr-package

Skim a data frame
sorted_count

Create a contingency table and arrange its levels in descending order