parDepPlot: Model interpretation functions: Partial Dependence Plots

Description

parDepPlot creates partial dependence plots for binary (cross-validated) classification models and regression models. Currently only binary classification models estimated with the packages randomForest and ada are supported. In addition randomForest regression models are supported.

Usage

parDepPlot(
  x.name,
  object,
  data,
  rm.outliers = TRUE,
  fact = 1.5,
  n.pt = 50,
  robust = FALSE,
  ci = FALSE,
  u.quant = 0.75,
  l.quant = 0.25,
  xlab = substr(x.name, 1, 50),
  ylab = NULL,
  main = if (any(class(object) %in% c("randomForest", "ada")))
    paste("Partial Dependence on", substr(x.name, 1, 20)) else
    paste("Cross-Validated Partial Dependence on", substr(x.name, 1, 10)),
  logit = TRUE,
  ylim = NULL,
  ...
)

Arguments

x.name: the name of the predictor as a character string for which a partial dependence plot has to be created.
object: can be a model or a list of cross- validated models. Currently only binary classification models built using the packages randomForest and ada are supported.
data: a data frame containing the predictors for the model or a list of data frames for cross-validation with length equal to the number of models.
rm.outliers: boolean, remove the outliers in x.name. Outliers are values that are smaller than max(Q1-fact*IQR,min) or greater than min(Q3+fact*IQR,max). Overridden if xlim is used.
fact: factor to use in rm.outliers. The default is 1.5.
n.pt: if x.name is a continuous predictor, the number of points that will be used to plot the curve.
robust: if TRUE then the median is used to plot the central tendency (recommended when logit=FALSE). If FALSE the mean is used.
ci: boolean. Should a confidence interval based on quantiles be plotted? This only works if robust=TRUE.
u.quant: Upper quantile for ci. This only works if ci=TRUE and robust=TRUE.
l.quant: Lower quantile for ci. This only works if ci=TRUE and robust=TRUE.
xlab: label for the x-axis. Is determined automatically if NULL.
ylab: label for the y-axis.
main: main title for the plot.
logit: boolean. Should the y-axis be on a logit scale or not? If FALSE, it is recommended to set robust=TRUE. Only applicable for classifcation.
ylim: The y limits of the plot
...: other graphical parameters for plot.

Author

Authors: Michel Ballings, and Dirk Van den Poel, Maintainer: Michel.Ballings@GMail.com

Details

For classification, the response variable in the model is always assumed to take on the values {0,1}. Resulting partial dependence plots always refer to class 1. Whenever strange results are obtained the user has three options. First set rm.outliers=TRUE. Second, if that doesn't help, set robust=TRUE. Finally, if that doesn't help, the user can also try setting ci=TRUE. Areas with larger confidence intervals typically indicate problem areas. These options help the user tease out the root of strange results and converge to better parameter values.

References

The code in this function uses part of the code from the partialPlot function in randomForest. It is expanded and generalized to support cross-validation and other packages.

Examples

Run this code


library(randomForest)
#Prepare data
data(iris)
iris <- iris[1:100,]
iris$Species <- as.factor(ifelse(factor(iris$Species)=="setosa",0,1))

#Cross-validated models
#Estimate 10 models and create 10 test sets
data <- list()
rf <- list()
for (i in 1:10) {
  ind <- sample(nrow(iris),50)
  rf[[i]] <- randomForest(Species~., iris[ind,])
  data[[i]] <- iris[-ind,]
}


parDepPlot(x.name="Petal.Width", object=rf, data=data)

#Single model
#Estimate a single model
ind <- sample(nrow(iris),50)
rf <- randomForest(Species~., iris[ind,])
parDepPlot(x.name="Petal.Width", object=rf, data=iris[-ind,])

Run the code above in your browser using DataLab