naniar

naniar aims to make it easy to summarise, visualise, and manipulate missing data. Currently it provides:

Data structures for missing data
- as_shadow
- bind_shadow
- gather_shadow
- is_na
Visualisation methods:
- geom_missing_point
- gg_missing_var
Numerical summaries:
- n_miss
- percent_missing_*
- summary_missing_*
- table_missing_*

For a more formal description, you can read the vignette "building on ggplot2 for exploration of missing values".

Why naniar?

naniar was previously named ggmissing and initially provided a ggplot geom and some visual summaries. It was changed to naniar to reflect the fact that this package is going to be bigger in scope, and is not just related to ggplot2. Specifically, the package is designed to provide a suite of tools for generating visualisations of missing values and imputations, manipulate, and summarise missing data.

But why naniar?

Well, I think it is useful to think of missing values in data being like this other dimension, perhaps like Narnia - a different world, hidden away. Close, but very different. So the name, "naniar", is a play on the "Narnia" books. e.g., naniar: The Last Battle (...with missing data). Also, NAniar, naniar = na in r, and if you so desire, naniar may sound like "noneoya" in an nz/aussie accent.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Data structures for missing data

Representing missing data structure is achieved using the shadow matrix, introduced in Swayne and Buja. The shadow matrix is the same dimension as the data, and consists of binary indicators of missingness of data values, where missing is represented as "NA", and not missing is represented as "!NA". Although these may be represented as 1 and 0, respectively. This representation can be seen in the figure below, adding the suffix "_NA" to the variables. This structure can also be extended to allow for additional factor levels to be created. For example 0 indicates data presence, 1 indicates missing values, 2 indicates imputed value, and 3 might indicate a particular type or class of missingness, where reasons for missingness might be known or inferred. The data matrix can also be augmented to include the shadow matrix, which facilitates visualisation of univariate and bivariate missing data visualisations. Another format is to display it in long form, which facilitates heatmap style visualisations. This approach can be very helpful for giving an overview of which variables contain the most missingness. Methods can also be applied to rearrange rows and columns to find clusters, and identify other interesting features of the data that may have previously been hidden or unclear.

Illustration of data structures for facilitating visualisation of missings and not missings

Visualising missing data

Visualising missing data might sound a little strange - how do you visualise something that is not there? One approach to visualising missing data comes from ggobi and manet, where we replace "NA" values with values 10% lower than the minimum value in that variable. This is provided with the geom_missing_point() ggplot2 geom, which we can illustrate by exploring the relationship between Ozone and Solar radiation from the airquality dataset.


library(ggplot2)

ggplot(data = airquality,
       aes(x = Ozone,
           y = Solar.R)) +
  geom_point()
#> Warning: Removed 42 rows containing missing values (geom_point).

ggplot2 does not handle these missing values, and we get a warning message about the missing values.

We can instead use the geom_missing_point() to display the missing data


library(naniar)

ggplot(data = airquality,
       aes(x = Ozone,
           y = Solar.R)) +
  geom_missing_point()

geom_missing_point() has shifted the missing values to now be 10% below the minimum value. The missing values are a different colour so that missingness becomes pre-attentive.

This plays nicely with other parts of ggplot, like adding transparency


ggplot(data = airquality,
       aes(x = Ozone,
           y = Solar.R)) +
  geom_missing_point(alpha = 0.5)

Thanks to Luke Smith for making this pull request.

like faceting - we can split the facet by month:

p1 <-
ggplot(data = airquality,
       aes(x = Ozone,
           y = Solar.R)) + 
  geom_missing_point() + 
  facet_wrap(~Month, ncol = 2) + 
  theme(legend.position = "bottom")

p1

And then change the theme, just like you do with any other ggplot graphic


p1 + theme_bw()

You can also look at the proportion of missings in each variable with gg_missing_var:


gg_missing_var(airquality)

You can also explore the whole dataset of missings using the vis_miss function from the visdat package.


# devtools::install_github("njtierney/visdat")
library(visdat)
vis_miss(airquality)

Another approach can be to use Univariate plots split by missingness. We can do this using the bind_shadow argument to place the data and shadow side by side. This allows for us to examine univariate distributions according to the presence or absence of another variable.

library(naniar)

aq_shadow <- bind_shadow(airquality)

aq_shadow
#> # A tibble: 153 × 12
#>    Ozone Solar.R  Wind  Temp Month   Day Ozone_NA Solar.R_NA Wind_NA
#>    <int>   <int> <dbl> <int> <int> <int>   <fctr>     <fctr>  <fctr>
#> 1     41     190   7.4    67     5     1      !NA        !NA     !NA
#> 2     36     118   8.0    72     5     2      !NA        !NA     !NA
#> 3     12     149  12.6    74     5     3      !NA        !NA     !NA
#> 4     18     313  11.5    62     5     4      !NA        !NA     !NA
#> 5     NA      NA  14.3    56     5     5       NA         NA     !NA
#> 6     28      NA  14.9    66     5     6      !NA         NA     !NA
#> 7     23     299   8.6    65     5     7      !NA        !NA     !NA
#> 8     19      99  13.8    59     5     8      !NA        !NA     !NA
#> 9      8      19  20.1    61     5     9      !NA        !NA     !NA
#> 10    NA     194   8.6    69     5    10       NA        !NA     !NA
#> # ... with 143 more rows, and 3 more variables: Temp_NA <fctr>,
#> #   Month_NA <fctr>, Day_NA <fctr>

The plot below shows the values of temperature when ozone is present and missing, on the left is a faceted histogram, and on the right is an overlaid density.


library(ggplot2)

p1 <- ggplot(data = aq_shadow,
       aes(x = Temp)) + 
  geom_histogram() + 
  facet_wrap(~Ozone_NA,
             ncol = 1)

p2 <- ggplot(data = aq_shadow,
       aes(x = Temp,
           colour = Ozone_NA)) + 
  geom_density() 

gridExtra::grid.arrange(p1, p2, ncol = 2)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Numerical summaries for missing data

naniar provides numerical summaries of missing data. The percent_missing_* functions help find the proportion of missing values in the data overall, in cases, or in variables.


# Proportion elements in dataset that contains missing values
percent_missing_df(airquality)
#> [1] 4.793028
# Proportion of variables that contain any missing values
percent_missing_var(airquality)
#> [1] 33.33333
 # Proportion of cases that contain any missing values
percent_missing_case(airquality)
#> [1] 27.45098

We can also look at the number and percent of missings in each case and variable with summary_missing_case, and summary_missing_var.


summary_missing_case(airquality)
#> # A tibble: 153 × 3
#>     case n_missing  percent
#>    <int>     <int>    <dbl>
#> 1      1         0  0.00000
#> 2      2         0  0.00000
#> 3      3         0  0.00000
#> 4      4         0  0.00000
#> 5      5         2 33.33333
#> 6      6         1 16.66667
#> 7      7         0  0.00000
#> 8      8         0  0.00000
#> 9      9         0  0.00000
#> 10    10         1 16.66667
#> # ... with 143 more rows
summary_missing_var(airquality)
#> # A tibble: 6 × 3
#>   variable n_missing   percent
#>      <chr>     <int>     <dbl>
#> 1    Ozone        37 24.183007
#> 2  Solar.R         7  4.575163
#> 3     Wind         0  0.000000
#> 4     Temp         0  0.000000
#> 5    Month         0  0.000000
#> 6      Day         0  0.000000

Tabulations of the number of missings in each case or variable can be calculated with table_missing_case and table_missing_var.


table_missing_case(airquality)
#> # A tibble: 3 × 3
#>   n_missing_in_case n_cases  percent
#>               <int>   <int>    <dbl>
#> 1                 0     111 72.54902
#> 2                 1      40 26.14379
#> 3                 2       2  1.30719
table_missing_var(airquality)
#> # A tibble: 3 × 3
#>   n_missing_in_var n_vars  percent
#>              <int>  <int>    <dbl>
#> 1                0      4 66.66667
#> 2                7      1 16.66667
#> 3               37      1 16.66667

All functions can be called at once using summarise_missingness, which takes a data.frame and then returns a nested dataframe containing the percentages of missing data, and lists of dataframes containing tally and summary information for the variables and cases.


s_miss <- summarise_missingness(airquality)

s_miss
#> # A tibble: 1 × 7
#>   percent_missing_df percent_missing_var percent_missing_case
#>                <dbl>               <dbl>                <dbl>
#> 1           4.793028            33.33333             27.45098
#> # ... with 4 more variables: table_missing_case <list>,
#> #   table_missing_var <list>, summary_missing_var <list>,
#> #   summary_missing_case <list>

# overall % missing data
s_miss$percent_missing_df
#> [1] 4.793028

# % of variables that contain missing data
s_miss$percent_missing_var
#> [1] 33.33333

# % of cases that contain missing data
s_miss$percent_missing_case
#> [1] 27.45098

# tabulations of missing data across cases
s_miss$table_missing_case
#> [[1]]
#> # A tibble: 3 × 3
#>   n_missing_in_case n_cases  percent
#>               <int>   <int>    <dbl>
#> 1                 0     111 72.54902
#> 2                 1      40 26.14379
#> 3                 2       2  1.30719

# tabulations of missing data across variables
s_miss$table_missing_var
#> [[1]]
#> # A tibble: 3 × 3
#>   n_missing_in_var n_vars  percent
#>              <int>  <int>    <dbl>
#> 1                0      4 66.66667
#> 2                7      1 16.66667
#> 3               37      1 16.66667

# summary information (counts, percentrages) of missing data for variables and cases
s_miss$summary_missing_var
#> [[1]]
#> # A tibble: 6 × 3
#>   variable n_missing   percent
#>      <chr>     <int>     <dbl>
#> 1    Ozone        37 24.183007
#> 2  Solar.R         7  4.575163
#> 3     Wind         0  0.000000
#> 4     Temp         0  0.000000
#> 5    Month         0  0.000000
#> 6      Day         0  0.000000
s_miss$summary_missing_case
#> [[1]]
#> # A tibble: 153 × 3
#>     case n_missing  percent
#>    <int>     <int>    <dbl>
#> 1      1         0  0.00000
#> 2      2         0  0.00000
#> 3      3         0  0.00000
#> 4      4         0  0.00000
#> 5      5         2 33.33333
#> 6      6         1 16.66667
#> 7      7         0  0.00000
#> 8      8         0  0.00000
#> 9      9         0  0.00000
#> 10    10         1 16.66667
#> # ... with 143 more rows

Other plotting functions

These dataframes from the tidying functions are then used in these plots

gg_missing_var


gg_missing_var(airquality)

gg_missing_case


gg_missing_case(airquality)

gg_missing_which

This shows whether a given variable contains a missing variable. In this case grey = missing. Think of it as if you are shading the cell in, if it contains data.


gg_missing_which(airquality)

Future Work

naniar will be undergoing more changes over the next 6 months, with plans to have the package in CRAN around the end of 2016.

Other plans to extend the geom_missing family to include:

1D, univariate distribution plots
Categorical variables
Bivariate plots: Scatterplots, Density overlays.
Provide

Acknowledgements

Naming credit (once again!) goes to @MilesMcBain, and to @hadley for the rearranged spelling. Also thank you to @dicook for her and @hadley for putting up with my various questions and concerns, mainly around the name.

naniar

Data structures for missing data

Visualising missing data

Numerical summaries for missing data

Other plotting functions

gg_missing_var

gg_missing_case

gg_missing_which

Future Work

Acknowledgements

Copy Link

Version

Install

Monthly Downloads

Version

License

Maintainer

Last Published

Functions in naniar (0.0.4.9000)