niv: Adjusted Net Information Value

Description

This function produces an adjusted net information value for each variable specified in the right hand side of the formula. This can be a helpful exploratory tool to (preliminary) determine the predictive power of each variable for uplift.

Usage

niv(formula, data, subset, na.action = na.pass, B = 10, direction = 1, 
nbins = 10, continuous = 4, plotit = TRUE, ...)

Arguments

formula

a formula expression of the form response ~ predictors. A special term of the form trt() must be used in the model equation to identify the binary treatment variable. For example, if the treatment is represented by a variable named trea

data

a data.frame in which to interpret the variables named in the formula.

subset

expression indicating which subset of the rows of data should be included. All observations are included by default.

na.action

a missing-data filter function. This is applied to the model.frame after any subset argument has been used. Default is na.action = na.pass.

the number of bootstrap samples used to compute the adjusted net information value.

direction

if set to 1 (default), the net weight of evidence is computed as the difference between the weight of evidence of the treatment and control groups, or if 2, it is computed as the difference between the weight of evidence of the c

nbins

the number of bins created from numeric predictors. The bins are created based on quantiles, with a default value of 10 (deciles).

continuous

specifies the threshold for when a variable is considered to be continuous (when there are at least continuous unique values). The default is 4. Factor variables are always considered to be categorical no matter how many levels they have.

plotit

plot the adjusted net information value for each variable?

...

additional arguments passed to barplot.

Value

A list with two components:
niv_vala matrix with the following columns: niv (the average net information value for each variable over all bootstrap samples), penalty (the penalty term calculated as described in the details above), the adjusted information value (the difference between the prior two colums)
nwoea list of matrices, one for each variable. The columns represent: the distribution of the responses (y=1) over the treated group (ct1.y1), the distribution of the non-responses (y=0) over the treated group (ct1.y0), the distribution of the responses (y=1) over the control group (ct0.y1), the distribution of the non-responses (y=0) over the control group (ct0.y0), the weight-of-evidence over the treated group (ct1.woe), the weight-of-evidence over the control group ct0.woe, and the net weigh-of-evidence (nwoe).

Details

The ordinary information value (commonly used in credit scoring applications) is given by

$$IV = \sum_{i=1}^{G} \left (P(x=i|y=1) - P(x=i|y=0) \right) \times WOE_i$$

where $G$ is the number of groups created from a numeric predictor or categories from a categorical predictor, and $WOE_i = ln (\frac{P(x=i|y=1)}{P(x=i|y=0)})$.

The net information value is the natural extension of the IV for the case of uplift. It is computed as

$$NIV = 100 \times \sum_{i=1}^{G}(P(x=i|y=1)^{T} \times P(x=i|y=0)^{C} - P(x=i|y=0)^{T} \times P(x=i|y=1)^{C}) \times NWOE_i$$

where $NWOE_i = WOE_i^{T} - WOE_i^{C}$

The adjusted net information value is computed as follows:

1. Take $B$ bootstrap samples and compute the NIV for each variable on each sample

2. Compute the mean of the NIV ($NIV_{mean}$) and sd of the NIV ($NIV_{sd}$) for each variable over all the $B$ bootstraps

3. The adjusted NIV for a given variable is computed by adding a penalty term to the mean NIV: $NIV_{mean} - \frac{NIV_{sd}}{\sqrt{B}}$.

References

Larsen, K. (2009). Net lift models. In: M2009 - 12th Annual SAS Data Mining Conference.

Examples

Run this code

library(uplift)

set.seed(12345)
dd <- sim_pte(n = 1000, p = 20, rho = 0, sigma =  sqrt(2), beta.den = 4)
dd$treat <- ifelse(dd$treat == 1, 1, 0) 

niv.1 <- niv(y ~ X1 + X2 + X3 + X4 + X5 + X6 + trt(treat), data = dd)            
niv.1$niv
niv.1$nwoe

Run the code above in your browser using DataLab