Learn R Programming


output: github_document

modellingTools: Common Tools for Data Preparation and Modelling

Programming in R is delightful. Data analysis in R can be a bit challenging at times. modellingTools was created to provide a formal outlet for useful personal tools I have developed in order to make data preparation and analysis simpler using R. I found that too often, when attempting to get to know my dataset using R, I fell in to the following pattern:

  • Try to use the basic functions available, like table for frequency distributions
  • Be unsatisfied with the usage/output, and spend 10 minutes attempting to modify it
  • Give up and lose interest

After a year or so of this, I started getting smart about it: every time I modified a base function in some useful way, I would save it in a function. But soon, I found myself following a new pattern:

  • Create a useful new function, e.g. for getting a list of input variables and their correlations with a response
  • Create a file called "great_new_functions.R" or "useful_helpers.R" or something
  • Come back in a week to similar problem, not remember where I saved the file, and start over, wasting time and recreating the function, usually with slightly different features
  • Come back in another week, and forget which version had which features, so I would create a third...

Finally I bought Hadley Wickham's book, and figured now's as good as ever to learn how to build a package. This solves my above problems because:

  • Unit Tests! My favourite thing. This package is tested extensively, so I am confident the code will work- and if not, please let me know and I will create more tests
  • Documentation: now I have written down exactly what every function does and what parameters they take
  • Version Control: I now only have one version of everything

A fourth benefit is: you get to use the package too! Thank you for doing so, and please let me know via email (alex@alexstringer.ca) if you have any bugs for me to fix, or suggestions for new features.

Example: Frequency distribution of a variable

Getting the frequency distribution of a variable in base R is actually surprisingly unpleasant. The table function requires vectors as input:

data(CO2)
table(CO2$conc)
#> 
#>   95  175  250  350  500  675 1000 
#>   12   12   12   12   12   12   12

As you can see, the output also isn't that pretty. You can clean up the code using with,

with(CO2,table(conc))
#> conc
#>   95  175  250  350  500  675 1000 
#>   12   12   12   12   12   12   12

or if you're really cutting-edge, with the %$% operator from the magrittr package:

# install.packages("magrittr")
library(magrittr)
CO2 %$% table(conc)
#> conc
#>   95  175  250  350  500  675 1000 
#>   12   12   12   12   12   12   12

All this for a basic frequency distribution. And don't even think about doing it for a continuous variable:

CO2 %$% table(uptake)
#> uptake
#>  7.7  9.3 10.5 10.6 11.3 11.4   12 12.3 12.5   13 13.6 13.7 14.2 14.4 14.9 
#>    1    1    1    2    1    1    1    1    1    1    1    1    1    1    1 
#> 15.1   16 16.2 17.9   18 18.1 18.9 19.2 19.4 19.5 19.9   21 21.9   22 22.2 
#>    1    1    1    3    1    1    2    1    1    1    1    1    1    1    1 
#> 24.1 25.8 26.2 27.3 27.8 27.9 28.1 28.5   30 30.3 30.4 30.6 30.9 31.1 31.5 
#>    1    1    1    2    1    1    1    1    1    1    1    1    1    1    1 
#> 31.8 32.4 32.5   34 34.6 34.8   35 35.3 35.4 35.5 37.1 37.2 37.5 38.1 38.6 
#>    1    3    1    1    1    1    1    1    1    1    1    1    1    1    1 
#> 38.7 38.8 38.9 39.2 39.6 39.7 40.3 40.6 41.4 41.8 42.1 42.4 42.9 43.9 44.3 
#>    1    1    1    1    1    1    1    1    2    1    1    1    1    1    1 
#> 45.5 
#>    1

Talk about hard to read, and that's only 84 observations!

Try proc_freq, from the modellingTools package. Advantages:

  • Simple to use; 3 arguments
  • Data all comes from the same dataframe
  • Output is a tbl_df, which is great for viewing- and can be used with the View() function to view in a neat spreadsheet right in RStudio
  • Automatic Discretization of Continuous Variables: this is amazing for dealing with datasets with a large number of observations
  • Missing values are always included in the output, because it is always important to know about missing values

We can do

proc_freq(CO2,"conc")
#> Source: local data frame [7 x 3]
#> 
#>   level count percent
#>   (dbl) (int)   (chr)
#> 1    95    12   14.3%
#> 2   175    12   14.3%
#> 3   250    12   14.3%
#> 4   350    12   14.3%
#> 5   500    12   14.3%
#> 6   675    12   14.3%
#> 7  1000    12   14.3%

as well as

proc_freq(CO2,"uptake")
#> Source: local data frame [76 x 3]
#> 
#>    level count percent
#>    (dbl) (int)   (chr)
#> 1    7.7     1   1.19%
#> 2    9.3     1   1.19%
#> 3   10.5     1   1.19%
#> 4   10.6     2   2.38%
#> 5   11.3     1   1.19%
#> 6   11.4     1   1.19%
#> 7   12.0     1   1.19%
#> 8   12.3     1   1.19%
#> 9   12.5     1   1.19%
#> 10  13.0     1   1.19%
#> ..   ...   ...     ...

The real value comes from

proc_freq(CO2,"uptake",bins = 4)
#> Source: local data frame [4 x 3]
#> 
#>         level count percent
#>        (fctr) (int)   (chr)
#> 1  [7.7,17.1]    19  22.62%
#> 2 (17.1,26.6]    18  21.43%
#> 3   (26.6,36]    25  29.76%
#> 4   (36,45.5]    22  26.19%

Installation Instructions

You can get the package, once it is on CRAN, by typing

install.packages("modellingTools")

Since I'm actively developing the package, it may just be better to use the development version:

install.packages("devtools")
devtools::install_github("awstringer/modellingTools")

After that, attach the package

library(modellingTools)

and you're good to go!

Overview

For a detailed overview and introduction to using the package and what it does, see the vignette. Check out the github page for all the code as well.

Copy Link

Version

Install

install.packages('modellingTools')

Monthly Downloads

12

Version

0.1.0

License

GPL-3

Maintainer

Alex Stringer

Last Published

May 2nd, 2016

Functions in modellingTools (0.1.0)

binned_data_cutpoints

Get the unique cutpoints of each appropriate column in a tbl.
vector_bin

Bin a vector into equal height, equal width, or custom bins
simple_bin

Discretize variables in your training and test datasets
proc_freq

View and return the frequency distribution of a variable.
get_top_corrs

Get the correlation of variables in a dataset with a given response, sorted highest to lowest
create_model_matrix

Create a usable model matrix from a data frame containing a mix of continuous and categorical variables
get_vector_cutpoints

Get the cutpoints from a single factor vector.
column_vector

Get the contents of a single column of a tbl