interestMeasure: Calculate Additional Interest Measures

Description

Provides the generic function interestMeasure and the needed S4 method to calculate various additional interest measures for existing sets of itemsets or rules. Definitions and equations can be found in Hahsler (2015).

Usage

interestMeasure(x, measure, transactions = NULL, reuse = TRUE, ...)

Arguments

a set of itemsets or rules.

measure

name or vector of names of the desired interest measures (see details for available measures). If measure is missing then all available measures are calculated.

transactions

the transaction data set used to mine the associations or a set of different transactions to calculate interest measures from (Note: you need to set reuse=FALSE in the later case).

reuse

logical indicating if information in quality slot should be reuse for calculating the measures. This speeds up the process significantly since only very little (or no) transaction counting is necessary if support, confidence and lift are already available. Use reuse=FALSE to force counting (might be very slow but is necessary if you use a different set of transactions than was used for mining).

…

further arguments for the measure calculation.

Value

If only one measure is used, the function returns a numeric vector containing the values of the interest measure for each association in the set of associations x.

If more than one measures are specified, the result is a data.frame containing the different measures for each association.

NA is returned for rules/itemsets for which a certain measure is not defined.

Details

For itemsets $X$ the following measures are implemented:

"allConfidence" (Omiencinski, 2003)

Is defined on itemsets as the minimum confidence of all possible rule generated from the itemset.

Range: $[0, 1]$

"crossSupportRatio", cross-support ratio (Xiong et al., 2003)

Defined on itemsets as the ratio of the support of the least frequent item to the support of the most frequent item, i.e., $\frac{min(supp(x \in X))}{max(supp(x \in X))}$. Cross-support patterns have a ratio smaller than a set threshold. Normally many found patterns are cross-support patterns which contain frequent as well as rare items. Such patterns often tend to be spurious.

Range: $[0, 1]$

"lift"

Probability (support) of the itemset over the product of the probabilities of all items in the itemset, i.e., $\frac{supp(X)}{\prod_{x \in X} supp(X)}$. This is a measure of dependence similar to lift for rules.

Range: $[0, \infty]$ (1 indicated independence)

"support", supp (Agrawal et al., 1996)

Support is an estimate of $P(X)$ a measure of generality of the itemset.

Range: $[0, 1]$

"count"

Absolute support count of the itemset.

Range: $[0, \infty]$

For rules $X \Rightarrow Y$ the following measures are implemented. In the following we use the notation $supp(X \Rightarrow Y) = supp(X \cup Y)$ to indicate the support of the union of the itemsets $X$ and $Y$, i.e., the proportion of the transactions that contain both itemsets. We also use $\overline{X}$ as the complement itemset to $X$ with $supp(\overline{X}) = 1 - supp(X)$, i.e., the proportion of transactions that do not contain $X$.

"addedValue", added Value, AV, Pavillon index, centered confidence (Tan et al., 2002)

Defined as $conf(X \Rightarrow Y) - supp(Y)$

Range: $[-.5, 1]$

"chiSquared", $\chi^2$ (Liu et al., 1999)

The chi-squared statistic to test for independence between the lhs and rhs of the rule. The critical value of the chi-squared distribution with $1$ degree of freedom (2x2 contingency table) at $\alpha=0.05$ is $3.84$; higher chi-squared values indicate that the lhs and the rhs are not independent. Note that the contingency table is likely to have cells with low expected values and that thus Fisher's Exact Test might be more appropriate (see below).

Called with significance=TRUE, the p-value of the test for independence is returned instead of the chi-squared statistic. For p-values, substitutes effects can be tested using the parameter complements = FALSE.

Range: $[0, \infty]$

"certainty", certainty factor, CF, Loevinger (Berzal et al., 2002)

The certainty factor is a measure of variation of the probability that Y is in a transaction when only considering transactions with X. An inreasing CF means a decrease of the probability that Y is not in a transaction that X is in. Negative CFs have a similar interpretation.

Range: $[-1, 1]$ (0 indicates independence)

"collectiveStrength"

Collective strength (S).

Range: $[0, \infty]$

"confidence", conf (Agrawal et al., 1996)

Rule confidence is an estimate of $P(Y|X)$ calculated as $\frac{supp(X \Rightarrow Y)}{supp(X)}$. Confidence is a measure of validity.

Range $[0, 1]$

"conviction" (Brin et al. 1997)

Defined as $\frac{supp(X)supp(\overline{Y})}{supp(X \cup \overline{Y})}$.

Range: $[0, \infty]$ ($1$ indicates unrelated items)

"cosine" (Tan et al., 2004)

Defined as $\frac{supp(X \cup Y)}{\sqrt{(supp(X)supp(Y))}}$

Range: $[0, 1]$

"count"

Absolute support count of the rule.

Range: $[0, \infty]$

"coverage", cover, LHS-support

Support of the left-hand-side of the rule, i.e., $supp(X)$. A measure of to how often the rule can be applied.

Range: $[0, 1]$

"confirmedConfidence", descriptive confirmed confidence (Kodratoff, 1999)

Confidence confirmed by its negative as $conf(X \Rightarrow Y) - conf(X \Rightarrow \overline{Y})$.

Range: $[-1, 1]$

"casualConfidence", casual confidence (Kodratoff, 1999)

Confidence reinforced by negatives given by $\frac{1}{2}(conf(X \Rightarrow Y) + conf(\overline{Y} \Rightarrow \overline{X}))$.

Range: $[0, 1]$

"casualSupport", casual support (Kodratoff, 1999)

Support improved by negatives given by $supp(X \cup Y) - supp(\overline{X} \cup \overline{Y})$.

Range: $[-1, 1]$

"counterexample", example and counter-example rate

$\frac{supp(X \cup Y) - supp(X \cup \overline{Y})}{supp(X \cup Y)}$

Range: $[0, 1]$

"descriptiveConfirm", descriptive-confirm (Kodratoff, 1999)

Defined by $supp(X \cup Y) - supp(X \cup \overline{Y})$.

Range: $[0, 1]$

"doc", difference of confidence (Hofmann and Wilhelm, 2001)

Defined as $conf(X \Rightarrow Y) - conf(\overline{X} \Rightarrow Y)$.

Range: $[-1, 1]$

"fishersExactTest", Fisher's exact test (Hahsler and Hornik, 2007)

p-value of Fisher's exact test used in the analysis of contingency tables where sample sizes are small. By default complementary effects are mined, substitutes can be found by using the parameter complements = FALSE.

Note that it is equal to hyper-confidence with significance=TRUE.

Range: $[0, 1]$ (p-value scale)

"gini", Gini index (Tan et al., 2004)

Measures quadratic entropy.

Range: $[0, 1]$ (0 for independence)

"hyperLift" (Hahsler and Hornik, 2007)

Adaptation of the lift measure which is more robust for low counts. It is based on the idea that under independence the count $c_{XY}$ of the transactions which contain all items in a rule $X \Rightarrow Y$ follows a hypergeometric distribution (represented by the random variable $C_{XY}$) with the parameters given by the counts $c_X$ and $c_Y$.

Hyper-lift is defined as: $$\mathrm{hyperlift}(X \Rightarrow Y) = \frac{c_{XY}}{Q_{\delta}[C_{XY}]},$$

where $Q_{\delta}[C_{XY}]$ is the quantile of the hypergeometric distribution given by $\delta$. The quantile can be given as parameter d (default: d=0.99).

Range: $[0, \infty]$ (1 indicates independence)

"hyperConfidence" (Hahsler and Hornik, 2007)

Confidence level for observation of too high/low counts for rules $X \Rightarrow Y$ using the hypergeometric model. Since the counts are drawn from a hypergeometric distribution (represented by the random variable $C_{XY}$) with known parameters given by the counts $c_X$ and $c_Y$, we can calculate a confidence interval for the observed counts $c_{XY}$ stemming from the distribution. Hyper-confidence reports the confidence level (significance level if significance=TRUE is used) for

complements -: $1 - P[C_{XY} >= c_{XY} | c_X, c_Y]$
substitutes -: $1 - P[C_{XY} < c_{XY} | c_X, c_Y]$.

A confidence level of, e.g., $> 0.95$ indicates that there is only a 5% chance that the count for the rule was generated randomly.

By default complementary effects are mined, substitutes can be found by using the parameter complements = FALSE.

Range: $[0, 1]$

"imbalance", imbalance ratio, IR (Wu, Chen and Han, 2010)

IR is defined as $\frac{|supp(X) - supp(Y)|}{supp(X) + supp(Y) - supp(X \Rightarrow Y))}$ gauges the degree of imbalance between two events that the lhs and the rhs are contained in a transaction. The ratio is close to 0 if the conditional probabilities are similar (i.e., very balanced) and close to 1 if they are very different.

Range: $[0, 1]$ (0 indicates a balanced rule)

"implicationIndex", implication index (Gras, 1996)

Defined as $\sqrt{N} \frac{supp(X \cup \overline{Y}) - supp(X)supp(\overline{Y})}{\sqrt{supp(X)supp(\overline{Y})}}$. Represents a variation of the Lerman similarity.

Range: $[0, 1]$ (0 means independence)

"importance" (MS Analysis Services)

Log likelihood of the right-hand side of the rule, given the left-hand side of the rule.

$log_{10}(L(X \Rightarrow Y) / L(X \Rightarrow \bar{Y}))$

where $L$ is the Laplace corrected confidence.

Range: $[-Inf, Inf]$

"improvement" (Bayardo et al., 2000)

The improvement of a rule is the minimum difference between its confidence and the confidence of any more general rule (i.e., a rule with the same consequent but one or more items removed in the LHS). Defined as $min_{X' \subset X}(conf(X \Rightarrow Y) - conf(X' \Rightarrow Y)$

Range: $[0, 1]$

"jaccard", Jaccard coefficient (Tan and Kumar, 2000) sometimes also called Coherence (Wu et al., 2010)

Null-invariant measure defined as $\frac{supp(X \cup Y)}{supp(X) + supp(Y) - supp(X \cup Y)}$

Range: $[-1, 1]$ (0 for independence)

"jMeasure", J-measure, J (Smyth and Goodman, 1991)

Measures cross entrophy.

Range: $[0, 1]$ (0 for independence)

"kappa" (Tan and Kumar, 2000)

Defined as $\frac{supp(X \cup Y) + supp(\overline{X} \cup \overline{Y}) - supp(X)supp(Y) - supp(\overline{X})supp(\overline{Y})}{1- supp(X)supp(Y) - supp(\overline{X})supp(\overline{Y})}$

Range: $[-1,1]$ (0 means independence)

"klosgen", Klosgen (Tan and Kumar, 2000)

Defined as $\sqrt{supp(X \cup Y)} conf(X \Rightarrow Y) - supp(Y)$

Range: $[-1, 1]$ (0 for independence)

"kulczynski" (Wu, Chen and Han, 2010; Kulczynski, 1927)

Calculate the null-invariant Kulczynski measure with a preference for skewed patterns.

Range: $[0, 1]$

"lambda", Goodman-Kruskal $\lambda$, predictive association (Tan and Kumar, 2000)

Range: $[0, 1]$

"laplace", Laplace corrected confidence, L (Tan and Kumar 2000)

Estimates confidence with increasing each count by 1. Prevents counts of 0 and L decreases with lower support.

Range: $[0, 1]$

"leastContradiction", least contradiction (Aze and Kodratoff, 2004

$\frac{supp(X \cup Y) - supp(X \cup \overline{Y})}{supp(Y)}$.

Range: $[-1, 1]$

"lerman", Lerman similarity (Lerman, 1981)

Defined as $\sqrt{N} \frac{supp(X \cup Y) - supp(X)supp(Y)}{\sqrt{supp(X)supp(Y)}}$

Range: $[0, 1]$

"leverage", PS (Piatetsky-Shapiro 1991)

PS is defined as $supp(X \Rightarrow Y) - supp(X)supp(Y)$. It measures the difference of X and Y appearing together in the data set and what would be expected if X and Y where statistically dependent. It can be interpreted as the gap to independence.

Range: $[-1, 1]$ (0 indicates intependence)

"lift", interest factor (Brin et al. 1997)

Lift quantifies dependence between X and Y by $\frac{supp(X \cup Y)}{supp(X)supp(Y)}$.

Range: $[0, \infty]$ (1 means independence)

"maxConfidence" (Wu et al. 2010)

Null-invariant measure defined as $max(conf(X \Rightarrow Y), conf(X \Rightarrow Y))$.

Range: $[0, 1]$

"mutualInformation", uncertainty, M (Tan et al., 2002)

Measures the information gain for Y provided by X.

Range: $[0, 1]$ (0 for independence)

"oddsRatio", odds ratio $\alpha$ (Tan et al., 2004)

The odds of finding X in transactions which contain Y divided by the odds of finding X in transactions which do not contain Y.

Range: $[0, \infty]$ ($1$ indicates that Y is not associated to X)

"phi", correlation coefficient $\phi$ (Tan et al., 2004

Equivalent to Pearsons Product Moment Correlation Coefficient $\rho$.

Range: $[-1, 1]$ (0 when X and Y are independent)

"ralambrodrainy", Ralambrodrainy Measure (Diatta et al., 2007)

Range: $[0, 1]$

"RLD", relative linkage disequilibrium (Kenett and Salini, 2008)

RLD evaluates the deviation of the support of the whole rule from the support expected under independence given the supports of the LHS and the RHS. The code was contributed by Silvia Salini.

Range: $[0, 1]$

"rulePowerFactor", rule power factor (Ochin et al., 2016)

Product of support and confidence. Can be seen as rule confidence weighted by support.

Range: $[0, 1]$

"sebag", Sebag measure (Sebag and Schoenauer, 1988)

Defined as $\frac{supp(X \cup Y)}{supp(X \cup \overline{Y})}$

Range: $[0, 1]$

"support", supp (Agrawal et al., 1996)

Support is an estimate of $P(X \cup Y)$ and measures the generality of the rule.

Range: $[0, 1]$

"varyingLiaison", varying rates liaison (Bernard and Charron, 1996)

Defined as $\frac{supp(X \cup Y)}{supp(X)supp(Y)}-1$. Is equivalent to $lift(X \Rightarrow Y) -1$

Range: $[-1, 1]$ (0 for independence)

"yuleQ", Yule's Q (Tan and Kumar, 2000)

Defined as $\frac{\alpha-1}{\alpha+1}$ where $\alpha$ is the odds ratio.

Range: $[-1, 1]$

"yuleY", Yule's Y (Tan and Kumar, 2000)

Defined as $\frac{\sqrt{\alpha}-1}{\sqrt{\alpha}+1}$ where $\alpha$ is the odds ratio.

Range: $[-1, 1]$

References

Hahsler, Michael (2015). A Probabilistic Comparison of Commonly Used Interest Measures for Association Rules, 2015, URL: http://michael.hahsler.net/research/association_rules/measures.html.

Agrawal, R., H Mannila, R Srikant, H Toivonen, AI Verkamo (1996). Fast Discovery of Association Rules. Advances in Knowledge Discovery and Data Mining 12 (1), 307--328.

Aze, J. and Y. Kodratoff (2004). Extraction de pepites de connaissances dans les donnees: Une nouvelle approche et une etude de sensibilite au bruit. In Mesures de Qualite pour la fouille de donnees. Revue des Nouvelles Technologies de l'Information, RNTI.

Bernard, Jean-Marc and Charron, Camilo (1996). L'analyse implicative bayesienne, une methode pour l'etude des dependances orientees. II : modele logique sur un tableau de contingence Mathematiques et Sciences Humaines, Volume 135 (1996), p. 5--18.

Bayardo, R. , R. Agrawal, and D. Gunopulos (2000). Constraint-based rule mining in large, dense databases. Data Mining and Knowledge Discovery, 4(2/3):217--240.

Berzal, Fernando, Ignacio Blanco, Daniel Sanchez and Maria-Amparo Vila (2002). Measuring the accuracy and interest of association rules: A new framework. Intelligent Data Analysis 6, 221--235.

Brin, Sergey, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Tsur (1997). Dynamic itemset counting and implication rules for market basket data. In SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, pages 255--264, Tucson, Arizona, USA.

Diatta, J., H. Ralambondrainy, and A. Totohasina (2007). Towards a unifying probabilistic implicative normalized quality measure for association rules. In Quality Measures in Data Mining, 237--250, 2007.

Hahsler, Michael and Kurt Hornik (2007). New probabilistic interest measures for association rules. Intelligent Data Analysis, 11(5):437--455.

Hofmann, Heike and Adalbert Wilhelm (2001). Visual comparison of association rules. Computational Statistics, 16(3):399--415.

Kenett, Ron and Silvia Salini (2008). Relative Linkage Disequilibrium: A New measure for association rules. In 8th Industrial Conference on Data Mining ICDM 2008, July 16--18, 2008, Leipzig/Germany.

Kodratoff, Y. (1999). Comparing Machine Learning and Knowledge Discovery in Databases: An Application to Knowledge Discovery in Texts. Lecture Notes on AI (LNAI) - Tutorial series.

Kulczynski, S. (1927). Die Pflanzenassoziationen der Pieninen. Bulletin International de l'Academie Polonaise des Sciences et des Lettres, Classe des Sciences Mathematiques et Naturelles B, 57--203.

Lerman, I.C. (1981). Classification et analyse ordinale des donnees. Paris.

Liu, Bing, Wynne Hsu, and Yiming Ma (1999). Pruning and summarizing the discovered associations. In KDD '99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 125--134. ACM Press, 1999.

Ochin, Suresh Kumar, and Nisheeth Joshi (2016). Rule Power Factor: A New Interest Measure in Associative Classification. 6th International Conference On Advances In Computing and Communications, ICACC 2016, 6-8 September 2016, Cochin, India.

Omiecinski, Edward R. (2003). Alternative interest measures for mining associations in databases. IEEE Transactions on Knowledge and Data Engineering, 15(1):57--69, Jan/Feb 2003.

Piatetsky-Shapiro, G. (1991). Discovery, analysis, and presentation of strong rules. In: Knowledge Discovery in Databases, pages 229--248.

Sebag, M. and M. Schoenauer (1988). Generation of rules with certainty and confidence factors from incomplete and incoherent learning bases. In Proceedings of the European Knowledge Acquisition Workshop (EKAW'88), Gesellschaft fuer Mathematik und Datenverarbeitung mbH, 28.1--28.20.

Smyth, Padhraic and Rodney M. Goodman (1991). Rule Induction Using Information Theory. Knowledge Discovery in Databases, 159--176.

Tan, Pang-Ning and Vipin Kumar (2000). Interestingness Measures for Association Patterns: A Perspective. TR 00-036, Department of Computer Science and Engineering University of Minnesota.

Tan, Pang-Ning, Vipin Kumar, and Jaideep Srivastava (2002). Selecting the right interestingness measure for association patterns. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '02), ACM, 32--41.

Tan, Pang-Ning, Vipin Kumar, and Jaideep Srivastava (2004). Selecting the right objective measure for association analysis. Information Systems, 29(4):293--313.

Wu, T., Y. Chen, and J. Han (2010). Re-examination of interestingness measures in pattern mining: A unified framework. Data Mining and Knowledge Discovery, 21(3):371-397, 2010.

Xiong, Hui, Pang-Ning Tan, and Vipin Kumar (2003). Mining strong affinity association patterns in data sets with skewed support distribution. In Bart Goethals and Mohammed J. Zaki, editors, Proceedings of the IEEE International Conference on Data Mining, November 19--22, 2003, Melbourne, Florida, pages 387--394.

Examples

Run this code

# NOT RUN {
data("Income")
rules <- apriori(Income)

## calculate a single measure and add it to the quality slot
quality(rules) <- cbind(quality(rules), 
	hyperConfidence = interestMeasure(rules, measure = "hyperConfidence", 
	transactions = Income))

inspect(head(rules, by = "hyperConfidence"))

## calculate several measures
m <- interestMeasure(rules, c("confidence", "oddsRatio", "leverage"), 
	transactions = Income)
inspect(head(rules))
head(m)

## calculate all available measures for the first 5 rules and show them as a 
## table with the measures as rows
t(interestMeasure(head(rules, 5), transactions = Income))

## calculate measures on a differnt set of transactions (I use a sample here)
## Note: reuse = TRUE (default) would just return the stored support on the
##	data set used for mining
newTrans <- sample(Income, 100)
m2 <- interestMeasure(rules, "support", transactions = newTrans, reuse = FALSE) 
head(m2)
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

Details

References

See Also

Examples