pw.assoc: Pairwise association measure between categorical variables

Description

This function computes some association measures between a categorical nominal variable and each of the other available predictors (also categorical variables).

Usage

pw.assoc(formula, data, weights=NULL, freq0c=NULL)

Arguments

formula

A formula of the type y~x1+x2 where y denotes the name of the categorical variable (a factor in R) which plays the role of the dependent variable while x1 and x2 are the name of t

data

The data frame which contains the variables called by formula.

weights

The name of the eventual variable in data which provides the units' weights. Weights are used to estimate frequencies (a cell frequency is estimated by summing the weights of the units which present the given characteristics). Default is

freq0c

A small number which is substituted to eventual cells with zero frequencies in order to avoid computation failures. When NULL (default) a cell with zero frequency is substitutes with 1/N^2, being N the sample size.

Value

A list object with for components.
VA vector with the estimated Cramer's V for each couple response-predictor.
labdaA vector with the values of Goodman-Kruscal $\lambda(R|C)$ for each couple response-predictor.
tauA vector with the values of Goodman-Kruscal $\tau(R|C)$ for each couple response-predictor.
UA vector whit the values of Theil's uncertainty coefficient U(R|C) for each couple response-predictor.

Details

This function computes some association measures among the response variable and each of the predictors specified in the formula. The following association measure are considered:

Cramer's V:

$$V=\sqrt{\frac{\chi^2}{N \times min\left[I-1,J-1\right]} }$$

N is the sample size, I is the number of rows and J is the number of columns. Cramer's V ranges from 0 to 1.

Goodman--Kruskal $\lambda(R|C)$:

$$\lambda(R|C) = \frac{\sum_{j=1}^J max_{i}(p_{ij}) - max_{i}(p_{i+})}{1-max_{i}(p_{i+})}$$

It ranges from 0 to 1, and denotes how much the knowledge of the column variable (predictor) helps in reducing the prediction error of the values of the row variable.

Goodman--Kruskal $\tau(R|C)$:

$$\tau(R|C) = \frac{ \sum_{i=1}^I \sum_{j=1}^J p^2_{ij}/p_{+j} - \sum_{i=1}^I p_{i+}^2}{1 - \sum_{i=1}^I p_{i+}^2}$$

It takes values in the interval [0,1] and has the same PRE meaning of the lambda.

Theil's Uncertainty coefficient:

$$U(R|C) = \frac{\sum_{i=1}^I \sum_{j=1}^J p_{ij} log(p_{ij}/p_{+j}) - \sum_{i=1}^I p_{i+} log p_{i+}}{- \sum_{i=1}^I p_{i+} log p_{i+}}$$

It takes values in the interval [0,1] and measure the reduction of uncertainty in the row variable due to knowing the column variable.

It is worth noting that $\lambda$, $\tau$ and U are asymmetric measures of the proportional reduction of the variance of the row column when passing from its marginal distribution to its conditional distribution given the column variable obtained starting from the general expression (cf. Agresti, 2002, p. 56):

$$\frac{V(R) - E[V(R|C)]}{V(R)}$$

They differ in the way of measuring variance, in fact it does not exist a general accepted definition of the variance of a categorical variable.

References

Agresti A (2002) Categorical Data Analysis. Second Edition. Wiley, new York.

Examples

Run this code

data(quine, package="MASS") #loads quine from MASS
str(quine)

# how Lrn is response variable
pw.assoc(Lrn~Age+Sex+Eth, data=quine)

# usage of units' weights
quine$ww <- runif(nrow(quine), 1,4) #random gen  1<=weights<=4
pw.assoc(Lrn~Age+Sex+Eth, data=quine, weights="ww")

Run the code above in your browser using DataLab