unpivotr
unpivotr deals with non-tabular data, especially from spreadsheets. Use unpivotr when your source data has any of these ‘features’:
- Multi-headered hydra
- Meaningful formatting
- Headers anywhere but at the top of each column
- Non-text headers e.g. dates
- Other stuff around the table
- Several similar tables in one sheet
- Sentinel values
- Superscript symbols
- Meaningful comments
- Nested HTML tables
If that list makes your blood boil, you’ll enjoy the function names.
behead()
deals with multi-headered hydra tables one layer of headers at a time, working from the edge of the table inwards. It’s a bit like usingheader = TRUE
inread.csv()
, but because it’s a function, you can apply it to as many layers of headers as you need. You end up with all the headers in columns.spatter()
is liketidyr::spread()
but preserves mixed data types. You get into a mixed-data-type situation by delaying type coercion until after the table is tidy (rather than before, likeread.csv()
et al). And yes, it usually followsbehead()
.
More positive, corrective functions:
justify()
aligns column headers beforebehead()
ing, and has deliberate moral overtones.enhead()
attaches a header to the body of the data, a la Frankenstein. The effect is the same asbehead()
, but is more powerful because you can choose exactly which header cells you want, paying attention to formatting (whichbehead()
doesn’t understand).isolate_sentinels()
separates meaningful symbols like"N/A"
or"confidential"
from the rest of the data, giving them some time alone think about what they’ve done.partition()
takes a sheet with several tables on it, and slashes into pieces that each contain one table. You can then unpivot each table in turn withpurrr::map()
or similar.
Make cells tidy
Unpivotr uses data where each cells is represented by one row in a dataframe. Like this.
What can you do with tidy cells? The best places to start are:
- Spreadsheet Munging Strategies, a free, online cookbook using tidyxl and unpivotr
- Screencasts on YouTube.
- Worked examples on GitHub.
Otherwise the basic idea is:
- Read the data with a specialist tool.
- For spreadsheets, use tidyxl.
- For plain text files, you might soon be able to use
readr, but for now you’ll have to
install a pull-request on that package with
devtools::install_github("tidyverse/readr#760")
. - For tables in html pages, use
unpivotr::tidy_html()
- For data frames, use
unpivotr::as_cells()
– this should be a last resort, because by the time the data is in a conventional data frame, it is often too late – formatting has been lost, and most data types have been coerced to strings.
- Either
behead()
straight away, elsedplyr::filter()
separately for the header cells and the data cells, and then recombine withenhead()
. spatter()
so that each column has one data type.
library(unpivotr)
library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
#> ✔ ggplot2 3.3.5 ✔ purrr 0.3.4
#> ✔ tibble 3.1.2 ✔ dplyr 1.0.7
#> ✔ tidyr 1.1.3 ✔ stringr 1.4.0
#> ✔ readr 1.4.0 ✔ forcats 0.5.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ✖ tidyr::pack() masks unpivotr::pack()
#> ✖ tidyr::unpack() masks unpivotr::unpack()
x <- purpose$`up-left left-up`
x # A pivot table in a conventional data frame. Four levels of headers, in two
#> X2 X3 X4 X5 X6 X7
#> 1 <NA> <NA> Female <NA> Male <NA>
#> 2 <NA> <NA> 0 - 6 7 - 10 0 - 6 7 - 10
#> 3 Bachelor's degree 15 - 24 7000 27000 <NA> 13000
#> 4 <NA> 25 - 44 12000 137000 9000 81000
#> 5 <NA> 45 - 64 10000 64000 7000 66000
#> 6 <NA> 65+ <NA> 18000 7000 17000
#> 7 Certificate 15 - 24 29000 161000 30000 190000
#> 8 <NA> 25 - 44 34000 179000 31000 219000
#> 9 <NA> 45 - 64 30000 210000 23000 199000
#> 10 <NA> 65+ 12000 77000 8000 107000
#> 11 Diploma 15 - 24 <NA> 14000 9000 11000
#> 12 <NA> 25 - 44 10000 66000 8000 47000
#> 13 <NA> 45 - 64 6000 68000 5000 58000
#> 14 <NA> 65+ 5000 41000 1000 34000
#> 15 No Qualification 15 - 24 10000 43000 12000 37000
#> 16 <NA> 25 - 44 11000 36000 21000 50000
#> 17 <NA> 45 - 64 19000 91000 17000 75000
#> 18 <NA> 65+ 16000 118000 9000 66000
#> 19 Postgraduate qualification 15 - 24 <NA> 6000 <NA> <NA>
#> 20 <NA> 25 - 44 5000 86000 7000 60000
#> 21 <NA> 45 - 64 6000 55000 6000 68000
#> 22 <NA> 65+ <NA> 13000 <NA> 18000
# rows and two columns.
y <- as_cells(x) # 'Tokenize' or 'melt' the data frame into one row per cell
y
#> # A tibble: 132 x 4
#> row col data_type chr
#> <int> <int> <chr> <chr>
#> 1 1 1 chr <NA>
#> 2 2 1 chr <NA>
#> 3 3 1 chr Bachelor's degree
#> 4 4 1 chr <NA>
#> 5 5 1 chr <NA>
#> 6 6 1 chr <NA>
#> 7 7 1 chr Certificate
#> 8 8 1 chr <NA>
#> 9 9 1 chr <NA>
#> 10 10 1 chr <NA>
#> # … with 122 more rows
rectify(y) # useful for reviewing the melted form as though in a spreadsheet
#> # A tibble: 22 x 7
#> `row/col` `1(A)` `2(B)` `3(C)` `4(D)` `5(E)` `6(F)`
#> <int> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 <NA> <NA> Female <NA> Male <NA>
#> 2 2 <NA> <NA> 0 - 6 7 - 10 0 - 6 7 - 10
#> 3 3 Bachelor's degree 15 - 24 7000 27000 <NA> 13000
#> 4 4 <NA> 25 - 44 12000 137000 9000 81000
#> 5 5 <NA> 45 - 64 10000 64000 7000 66000
#> 6 6 <NA> 65+ <NA> 18000 7000 17000
#> 7 7 Certificate 15 - 24 29000 161000 30000 190000
#> 8 8 <NA> 25 - 44 34000 179000 31000 219000
#> 9 9 <NA> 45 - 64 30000 210000 23000 199000
#> 10 10 <NA> 65+ 12000 77000 8000 107000
#> # … with 12 more rows
y %>%
behead("up-left", "sex") %>% # Strip headers
behead("up", "life-satisfication") %>% # one
behead("left-up", "qualification") %>% # by
behead("left", "age-band") %>% # one.
select(-row, -col, -data_type, count = chr) %>% # cleanup
mutate(count = as.integer(count))
#> # A tibble: 80 x 5
#> count sex `life-satisfication` qualification `age-band`
#> <int> <chr> <chr> <chr> <chr>
#> 1 7000 Female 0 - 6 Bachelor's degree 15 - 24
#> 2 12000 Female 0 - 6 Bachelor's degree 25 - 44
#> 3 10000 Female 0 - 6 Bachelor's degree 45 - 64
#> 4 NA Female 0 - 6 Bachelor's degree 65+
#> 5 27000 Female 7 - 10 Bachelor's degree 15 - 24
#> 6 137000 Female 7 - 10 Bachelor's degree 25 - 44
#> 7 64000 Female 7 - 10 Bachelor's degree 45 - 64
#> 8 18000 Female 7 - 10 Bachelor's degree 65+
#> 9 NA Male 0 - 6 Bachelor's degree 15 - 24
#> 10 9000 Male 0 - 6 Bachelor's degree 25 - 44
#> # … with 70 more rows
Note the compass directions in the code above, which hint to behead()
where to find the header cell for each data cell.
"up-left"
means the header (Female
,Male
) is positioned up and to the left of the columns of data cells it describes."up"
means the header (0 - 6
,7 - 10
) is positioned directly above the columns of data cells it describes."left-up"
means the header (Bachelor's degree
,Certificate
, etc.) is positioned to the left and upwards of the rows of data cells it describes."left"
means the header (15 - 24
,25 - 44
, etc.) is positioned directly to the left of the rows of data cells it describes.
Installation
# install.packages("devtools") # If you don't already have devtools
devtools::install_github("nacnudus/unpivotr", build_vignettes = TRUE)
The version 0.4.0 release had somee breaking changes. See NEWS.md
for
details. The previous version can be installed as follow:
devtools::install_version("unpivotr", version = "0.3.1", repos = "http://cran.us.r-project.org")
Similar projects
unpivotr is inspired by Databaker, a collaboration between the United Kingdom Office of National Statistics and The Sensible Code Company. unpivotr.
jailbreaker attempts to extract non-tabular data from spreadsheets into tabular structures automatically via some clever algorithms. unpivotr differs by being less magic, and equipping you to express what you want to do.