Learn R Programming

StatMatch (version 1.4.2)

pw.assoc: Pairwise measures between categorical variables

Description

This function computes some association and Proportional Reduction in Error (PRE) measures between a categorical nominal variable and each of the other available predictors (being also categorical variables).

Usage

pw.assoc(formula, data, weights=NULL, out.df=FALSE)

Value

When out.df=FALSE (default) a list object with four components:

V

A vector with the estimated Cramer's V for each couple response-predictor.

bcV

A vector with the estimated bias-corrected Cramer's V for each couple response-predictor.

mi

A vector with the estimated mutual information I(X;Y) for each couple response-predictor.

norm.mi

A vector with the normalized mutual information I(X;Y)* for each couple response-predictor.

lambda

A vector with the values of Goodman-Kruscal \(\lambda(Y|X)\) for each couple response-predictor.

tau

A vector with the values of Goodman-Kruscal \(\tau(Y|X)\) for each couple response-predictor.

U

A vector with the values of Theil's uncertainty coefficient U(Y|X) for each couple response-predictor.

AIC

A vector with the values of AIC(Y|X) for each couple response-predictor.

BIC

A vector with the values of BIC(Y|X) for each couple response-predictor.

npar

A vector with the number of parameters (conditional probabilities) estimated to calculate AIC and BIC for each couple response-predictor.

When out.df=TRUE the output will be a data.frame with a column for each measure.

Arguments

formula

A formula of the type y~x1+x2 where y denotes the name of the categorical variable (a factor in R) which plays the role of the dependent variable, while x1 and x2 are the name of the predictors (both categorical variables). Numeric variables are not allowed; eventual numerical variables should be categorized (see function cut) before being passed to pw.assoc.

data

The data frame which contains the variables called by formula.

weights

The name of the variable in data which provides the units' weights. Weights are used to estimate frequencies (a cell frequency is estimated by summing the weights of the units which present the given characteristic). Default is NULL (no weights available) and each unit counts 1. When case weight are provided, then they are scales so that their sum equals n, the sample size (assumed to be nrow(data)).

out.df

Logical. If NULL measures will be organized in a data frame (a column for each measure).

Author

Marcello D'Orazio mdo.statmatch@gmail.com

Details

This function computes some association, PRE measures, AIC and BIC for each couple response-predictor that can be created starting from argument formula. In particular, a two-way contingency table \(X \times Y\) is built for each available X variable (X in rows and Y in columns); then the following measures are considered.

Cramer's V:

$$ V=\sqrt{\frac{\chi^2}{n \times min\left[I-1,J-1\right]} } $$

n is the sample size, I is the number of rows (categories of X) and J is the number of columns (categories of Y). Cramer's V ranges from 0 to 1.

Bias-corrected Cramer's V (\(V_c\)) proposed by Bergsma (2013).

Mutual information:

$$ I(X;Y) = \sum_{i,j} p_{ij} \, log \left( \frac{p_{ij}}{p_{i+} p_{+j}} \right) $$

equal to 0 in case of independence but with infinite upper bound, i.e. \(0 \leq I(X;Y) < \infty\). In it \(p_{ij}=n_{ij}/n \).

A normalized version of \(I(X;Y)\), ranging from 0 (independence) to 1 and not affected by number of categories (I and J):

$$I(X;Y)^* = \frac{I(X;Y)}{min(H_X, H_Y) } $$

being \(H_X\) and \(H_Y\) the entropy of the variable X and Y, respectively.

Goodman-Kruskal \(\lambda(Y|X)\) (i.e. response conditional on the given predictor):

$$ \lambda(Y|X) = \frac{\sum_{i=1}^I max_{j}(p_{ij}) - max_{j}(p_{+j})}{1-max_{j}(p_{+j})} $$

It ranges from 0 to 1, and denotes how much the knowledge of the row variable X (predictor) helps in reducing the prediction error of the values of the column variable Y (response).

Goodman-Kruskal \(\tau(Y|X)\):

$$ \tau(Y|X) = \frac{ \sum_{i=1}^I \sum_{j=1}^J p^2_{ij}/p_{i+} - \sum_{j=1}^J p_{+j}^2}{1 - \sum_{j=1}^J p_{+j}^2} $$

It takes values in the interval [0,1] and has the same PRE meaning of the lambda.

Theil's uncertainty coefficient:

$$ U(Y|X) = \frac{\sum_{i=1}^I \sum_{j=1}^J p_{ij} log(p_{ij}/p_{i+}) - \sum_{j=1}^J p_{+j} log p_{+j}}{- \sum_{j=1}^J p_{+j} log p_{+j}} $$

It takes values in the interval [0,1] and measures the reduction of uncertainty in the column variable Y due to knowing the row variable X. Note that the numerator of U(Y|X) is the mutual information I(X;Y)

It is worth noting that \(\lambda\), \(\tau\) and U can be viewed as measures of the proportional reduction of the variance of the Y variable when passing from its marginal distribution to its conditional distribution given the predictor X, derived from the general expression (cf. Agresti, 2002, p. 56):

$$ \frac{V(Y) - E[V(Y|X)]}{V(Y)}$$

They differ in the way of measuring variance, in fact it does not exist a general accepted definition of the variance for a categorical variable.

Finally, AIC (and BIC) is calculated, as suggested in Sakamoto and Akaike (1977). In particular:

$$ AIC(Y|X) = -2 \sum_{i,j} n_{ij} \, log \left( \frac{n_{ij}}{n_{i+}} \right) + 2I(J - 1) $$

$$ BIC(Y|X) = -2 \sum_{i,j} n_{ij} \, log \left( \frac{n_{ij}}{n_{i+}} \right) +I(J-1) log(n) $$

being \(I(J-1)\) the parameters (conditional probabilities) to estimate. Note that the R package catdap provides functions to identify the best subset of predictors based on AIC.

Please note that the missing values are excluded from the tables and therefore excluded from the estimation of the various measures.

References

Agresti A (2002) Categorical Data Analysis. Second Edition. Wiley, new York.

Bergsma W (2013) A bias-correction for Cramer's V and Tschuprow's T. Journal of the Korean Statistical Society, 42, 323--328.

The Institute of Statistical Mathematics (2018). catdap: Categorical Data Analysis Program Package. R package version 1.3.4. https://CRAN.R-project.org/package=catdap

Sakamoto Y and Akaike, H (1977) Analysis of Cross-Classified Data by AIC. Ann. Inst. Statist. Math., 30, 185-197.

Examples

Run this code
data(quine, package="MASS") #loads quine from MASS
str(quine)

# how Lrn is response variable
pw.assoc(Lrn~Age+Sex+Eth, data=quine)

# usage of units' weights
quine$ww <- runif(nrow(quine), 1,4) #random gen  1<=weights<=4
pw.assoc(Lrn~Age+Sex+Eth, data=quine, weights="ww")

Run the code above in your browser using DataLab