Provides the generic function interestMeasure
and the needed S4 method
to calculate various additional interest measures for existing sets of
itemsets or rules. A searchable list of definitions, equations and references for all available interest measures can be found here:
A Probabilistic Comparison of Commonly Used Interest Measures for Association Rules (Hahsler, 2015).
interestMeasure(x, measure, transactions = NULL, reuse = TRUE, ...)
a set of itemsets or rules.
name or vector of names of the desired interest measures (see details for available measures). If measure is missing then all available measures are calculated.
the transaction data set used to mine
the associations or a set of different transactions to calculate
interest measures from (Note: you need to set reuse=FALSE
in the
later case).
logical indicating if information in quality slot should
be reuse for calculating the measures. This speeds up the process
significantly since only very little (or no) transaction counting
is necessary if support, confidence and lift are already available.
Use reuse=FALSE
to force counting (might be very slow but
is necessary if you use a different set of transactions than was used
for mining).
further arguments for the measure calculation.
If only one measure is used, the function returns a numeric vector
containing the values of the interest measure for each association
in the set of associations x
.
If more than one measures are specified, the result is a data.frame containing the different measures for each association as columns.
NA
is returned for rules/itemsets for which a certain measure is not
defined.
The following measures are implemented for itemsets \(X\):
Is defined on itemsets as the minimum confidence of all possible rule generated from the itemset. See details: All-Confidence
Range: \([0, 1]\)
Defined on itemsets as the ratio of the support of the least frequent item to the support of the most frequent item. Cross-support patterns have a ratio smaller than a set threshold. Normally many found patterns are cross-support patterns which contain frequent as well as rare items. Such patterns often tend to be spurious. See details: Cross-Support Ratio
Range: \([0, 1]\)
Lift is typically only defined for rules. In a similar way, we use the probability (support) of the itemset over the product of the probabilities of all items in the itemset, i.e., \(\frac{supp(X)}{\prod_{x \in X} supp(X)}\).
Range: \([0, \infty]\) (1 indicated independence)
Support is an estimate of \(P(X)\), a measure of generality of the itemset. It is estimated by the number of transactions that contain the itemset over the total number of transactions in the data set. See details: Support
Range: \([0, 1]\)
Absolute support count of the itemset, i.e., the number of transactions that contain the itemset. See details: Support Count
Range: \([0, \infty]\)
The following measures are implemented for rules of the form \(X \Rightarrow Y\):
Defined as the rule confidence minus the rules support. See details: Added Value
Range: \([-.5, 1]\)
The chi-squared statistic to test for independence between the lhs and rhs of the rule. The critical value of the chi-squared distribution with \(1\) degree of freedom (2x2 contingency table) at \(\alpha=0.05\) is \(3.84\); higher chi-squared values indicate that the lhs and the rhs are not independent. See details: Chi-Squared statistic
Note that the contingency table is likely to have cells with low expected values and that thus Fisher's Exact Test might be more appropriate (see below).
Called with significance=TRUE
, the p-value of the test for
independence is returned instead of the chi-squared statistic.
For p-values, substitutes effects can be tested using
the parameter complements = FALSE
.
Range: \([0, \infty]\)
The certainty factor is a measure of variation of the probability that Y is in a transaction when only considering transactions with X. An increasing CF means a decrease of the probability that Y is not in a transaction that X is in. Negative CFs have a similar interpretation. See details: Certainty Factor
Range: \([-1, 1]\) (0 indicates independence)
Collective strength (S).
Collective strength gives 0 for perfectly negative correlated items, infinity for perfectly positive correlated items, and 1 if the items co-occur as expected under independence. See details: Collective Strength
Range: \([0, \infty]\)
Confidence is a measure of rule validity. Rule confidence is an estimate of \(P(Y|X)\). See details: Confidence
Range \([0, 1]\)
Conviction was developed as an alternative to lift that also incorporates the direction of the rule. See details: Conviction
Range: \([0, \infty]\) (\(1\) indicates unrelated items)
A measure if correlation between the items in X and Y. See details: Cosine
Range: \([0, 1]\)(\(.5\) indicates no correlation)
Absolute support count of the rule, i.e., the number of transactions that contain all items in the rule. See details: Support Count
Range: \([0, \infty]\)
It measures the probability that a rule applies to a randomly selected transaction. It is estimated by the proportion of transactions that contain the antecedent (LHS) of the rule. Therefore, coverage is sometimes called antecedent support or LHS support. See details: Coverage
Range: \([0, 1]\)
How much higher is the confidence of a rule compared to the confidence of the rule \(X \Rightarrow \overline{Y}\). See details: Descriptive Confirmed Confidence
Range: \([-1, 1]\)
Confidence reinforced by the confidence of the rule \(\overline{X} \Rightarrow \overline{Y}\). See details: Casual Confidence
Range: \([0, 1]\)
Support reinforced by the support of the rule \(\overline{X} \Rightarrow \overline{Y}\). See details: Casual Support
Range: \([-1, 1]\)
Rate of the examples minus the rate of counter examples (i.e., \(X \Rightarrow \overline{Y}\)). See details: Example and Counter-example Rate
Range: \([0, 1]\)
Defined as the difference in confidence of the rule and the rule \(\overline{X} \Rightarrow Y\) See details: Difference of Confidence
Range: \([-1, 1]\)
p-value of Fisher's exact test used in the analysis of contingency tables
where sample sizes are small.
By default complementary effects are mined, substitutes can be found
by using the parameter complements = FALSE
.
See details: Fisher's Exact Test
Note that it is equal to hyper-confidence with significance=TRUE
.
Range: \([0, 1]\) (p-value scale)
Measures quadratic entropy of a rule. See details: Gini index
Range: \([0, 1]\) (0 means the rule provides no information for the data set)
Confidence level that the observed co-occurrence count of the LHS and RHS is too high given the expected count using the hypergeometric model. See details: Hyper-Confidence
Hyper-confidence reports the confidence level by default and the
significance level if significance=TRUE
is used.
By default complementary effects are mined, substitutes (too low co-occurrence counts) can be found
by using the parameter complements = FALSE
.
Range: \([0, 1]\)
Adaptation of the lift measure which evaluates the deviation from independence using a quantile of the hypergeometric distribution defined by the counts of the LHS and RHS. HyperLift can be used to calculate confidence intervals for the lift measure. See details: Hyper-Lift
The used quantile can be given
as parameter d
(default: d=0.99
).
Range: \([0, \infty]\) (1 indicates independence)
IR measures the degree of imbalance between the two events that the lhs and the rhs are contained in a transaction. The ratio is close to 0 if the conditional probabilities are similar (i.e., very balanced) and close to 1 if they are very different. See also: Imbalance ratio
Range: \([0, 1]\) (0 indicates a balanced rule)
A variation of the Lerman similarity. See details: Implication Index
Range: \([0, 1]\) (0 means independence)
Log likelihood of the right-hand side of the rule, given the left-hand side of the rule using Laplace corrected confidence. See details: Importance
Range: \([-Inf, Inf]\)
The improvement of a rule is the minimum difference between its confidence and the confidence of any more general rule (i.e., a rule with the same consequent but one or more items removed in the LHS). See details: Improvement
Range: \([0, 1]\)
Null-invariant measure of dependence defined as the Jaccard similarity between the two sets of transactions that contain the items in X and Y, respectively. See details: Jaccard coefficient
Range: \([-1, 1]\) (0 for independence)
A scaled measures of cross entropy to measure the information content of a rule. See details: J-Measure
Range: \([0, 1]\) (0 indicates X does not provide information for Y)
Cohen's Kappa of the rule (seen as a classifier) given as the rules observed rule accuracy (i.e., confidence) corrected by the expected accuracy (of a random classifier). See details: Cohen's Kappa
Range: \([-1,1]\) (0 means the rule is not better than a random classifier)
Defined as \(\sqrt{supp(X \cup Y)} conf(X \Rightarrow Y) - supp(Y)\) See details: Klosgen measure
Range: \([-1, 1]\) (0 for independence)
Calculate the null-invariant Kulczynski measure with a preference for skewed patterns. See details: Kulczynski measure
Range: \([0, 1]\)
Goodman and Kruskal's lambda to assess the association between the LHS and RHS of the rule. See details: Goodman-Kruskal's Lambda
Range: \([0, 1]\)
Estimates confidence by increasing each count by 1. Parameter k
can be used
to specify the number of classes (default is 2).
Prevents counts of 0 and L decreases with lower support.
See details: Laplace corrected confidence/accuracy
Range: \([0, 1]\)
Probability of finding a matching transaction minus the probability of finding a contradicting transaction normalized by the probability of finding a transaction containing Y. See details: Least Contradiction
Range: \([-1, 1]\)
Defined as \(\sqrt{N} \frac{supp(X \cup Y) - supp(X)supp(Y)}{\sqrt{supp(X)supp(Y)}}\) See details: Lerman similarity
Range: \([0, 1]\)
PS measures the difference of X and Y appearing together in the data set and what would be expected if X and Y where statistically dependent. It can be interpreted as the gap to independence. See details: Leverage
Range: \([-1, 1]\) (0 indicates independence)
Lift quantifies dependence between X and Y by comparing the probability that X and Y are contained in a transaction to the expected probability under independence (i.e., the product of the probabilities that X is contained in a transaction times the probability that Y is contained in a transaction). See details: Lift
Range: \([0, \infty]\) (1 means independence between LHS and RHS)
Null-invariant symmetric measure defined as the larger of the confidence of the rule or the rule with X and Y exchanged. See details: MaxConfidence
Range: \([0, 1]\)
Measures the information gain for Y provided by X. See details: Mutual Information
Range: \([0, 1]\) (0 means that X does not provide information for Y)
The odds of finding X in transactions which contain Y divided by the odds of finding X in transactions which do not contain Y. See details: Odds Ratio
Range: \([0, \infty]\) (\(1\) indicates that Y is not associated to X)
Correlation coefficient between the transactions containing X and Y represented as two binary vectors. Phi correlation is equivalent to Pearson's Product Moment Correlation Coefficient \(\rho\) with 0-1 values. See details: Phi Correlation Coefficient
Range: \([-1, 1]\) (0 when X and Y are independent)
The measure is defined as the probability that a transaction contains X but not Y. A smaller value is better. See details: Ralambondrainy Measure
Range: \([0, 1]\)
Range: \([0, 1]\)
RLD is an association measure motivated by indices used in population genetics. It evaluates the deviation of the support of the whole rule from the support expected under independence given the supports of the LHS and the RHS. See details: Relative linkage disequilibrium
The code was contributed by Silvia Salini.
Range: \([0, 1]\)
Product of support and confidence. Can be seen as rule confidence weighted by support. See details: Rule Power Factor
Range: \([0, 1]\)
Confidence of a rule divided by the confidence of the rule \(X \Rightarrow \overline{Y}\). See details: Sebag-Schoenauer measure
Range: \([0, 1]\)
Standardized lift uses the minimum and maximum lift can reach for each rule to standardize lift between 0 and 1. By default, the measure is corrected for minimum support and minimum confidence. Correction can be disabled by using the argument correct = FALSE
.
See details: Standardized Lift
Range: \([0, 1]\)
Support is an estimate of \(P(X \cup Y)\) and measures the generality of the rule. See details: Support
Range: \([0, 1]\)
Defined as the lift of a rule minus 1 so 0 represents independence. See details: Varying Rates Liaison
Range: \([-1, \infty]\) (0 for independence)
Defined as \(\frac{\alpha-1}{\alpha+1}\) where \(\alpha\) is the odds ratio. See details: Yule's Q
Range: \([-1, 1]\)
Defined as \(\frac{\sqrt{\alpha}-1}{\sqrt{\alpha}+1}\) where \(\alpha\) is the odds ratio. See details: Yule's Y
Range: \([-1, 1]\)
Hahsler, Michael (2015). A Probabilistic Comparison of Commonly Used Interest Measures for Association Rules, 2015, URL: http://michael.hahsler.net/research/association_rules/measures.html.
Agrawal, R., H Mannila, R Srikant, H Toivonen, AI Verkamo (1996). Fast Discovery of Association Rules. Advances in Knowledge Discovery and Data Mining 12(1):307--328.
Aze, J. and Y. Kodratoff (2004). Extraction de pepites de connaissances dans les donnees: Une nouvelle approche et une etude de sensibilite au bruit. In Mesures de Qualite pour la fouille de donnees. Revue des Nouvelles Technologies de l'Information, RNTI.
Bernard, Jean-Marc and Charron, Camilo (1996). L'analyse implicative bayesienne, une methode pour l'etude des dependances orientees. II : modele logique sur un tableau de contingence Mathematiques et Sciences Humaines, Volume 135 (1996), p. 5--18.
Bayardo, R. , R. Agrawal, and D. Gunopulos (2000). Constraint-based rule mining in large, dense databases. Data Mining and Knowledge Discovery, 4(2/3):217--240.
Berzal, Fernando, Ignacio Blanco, Daniel Sanchez and Maria-Amparo Vila (2002). Measuring the accuracy and interest of association rules: A new framework. Intelligent Data Analysis 6, 221--235.
Bing, Liu, Wynne Hsu, and Yiming Ma (1999). Pruning and summarizing the discovered associations. In KDD '99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 125--134. ACM Press, 1999.
Brin, Sergey, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Tsur (1997). Dynamic itemset counting and implication rules for market basket data. In SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, pages 255--264, Tucson, Arizona, USA.
Diatta, J., H. Ralambondrainy, and A. Totohasina (2007). Towards a unifying probabilistic implicative normalized quality measure for association rules. In Quality Measures in Data Mining, 237--250, 2007.
Gras R (1996). L'implication statistique. Nouvelle methode exploratoire de donnees. La Pensee Sauvage, Grenoble.
Hahsler, Michael and Kurt Hornik (2007). New probabilistic interest measures for association rules. Intelligent Data Analysis, 11(5):437--455.
Hofmann, Heike and Adalbert Wilhelm (2001). Visual comparison of association rules. Computational Statistics, 16(3):399--415.
Kenett, Ron and Silvia Salini (2008). Relative Linkage Disequilibrium: A New measure for association rules. In 8th Industrial Conference on Data Mining ICDM 2008, July 16--18, 2008, Leipzig/Germany.
Kodratoff, Y. (1999). Comparing Machine Learning and Knowledge Discovery in Databases: An Application to Knowledge Discovery in Texts. Lecture Notes on AI (LNAI) - Tutorial series.
Kulczynski, S. (1927). Die Pflanzenassoziationen der Pieninen. Bulletin International de l'Academie Polonaise des Sciences et des Lettres, Classe des Sciences Mathematiques et Naturelles B, 57--203.
Lerman, I.C. (1981). Classification et analyse ordinale des donnees. Paris.
McNicholas, P.D., T.B. Murphy, M. O'Regan (2008). Standardising the lift of an association rule, Computational Statistics & Data Analysis, 52(10):4712--4721, ISSN 0167-9473, 10.1016/j.csda.2008.03.013.
Ochin, Suresh Kumar, and Nisheeth Joshi (2016). Rule Power Factor: A New Interest Measure in Associative Classification. 6th International Conference On Advances In Computing and Communications, ICACC 2016, 6-8 September 2016, Cochin, India.
Omiecinski, Edward R. (2003). Alternative interest measures for mining associations in databases. IEEE Transactions on Knowledge and Data Engineering, 15(1):57--69, Jan/Feb 2003.
Piatetsky-Shapiro, G. (1991). Discovery, analysis, and presentation of strong rules. In: Knowledge Discovery in Databases, pages 229--248.
Sebag, M. and M. Schoenauer (1988). Generation of rules with certainty and confidence factors from incomplete and incoherent learning bases. In Proceedings of the European Knowledge Acquisition Workshop (EKAW'88), Gesellschaft fuer Mathematik und Datenverarbeitung mbH, 28.1--28.20.
Smyth, Padhraic and Rodney M. Goodman (1991). Rule Induction Using Information Theory. Knowledge Discovery in Databases, 159--176.
Tan, Pang-Ning and Vipin Kumar (2000). Interestingness Measures for Association Patterns: A Perspective. TR 00-036, Department of Computer Science and Engineering University of Minnesota.
Tan, Pang-Ning, Vipin Kumar, and Jaideep Srivastava (2002). Selecting the right interestingness measure for association patterns. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '02), ACM, 32--41.
Tan, Pang-Ning, Vipin Kumar, and Jaideep Srivastava (2004). Selecting the right objective measure for association analysis. Information Systems, 29(4):293--313.
Wu, T., Y. Chen, and J. Han (2010). Re-examination of interestingness measures in pattern mining: A unified framework. Data Mining and Knowledge Discovery, 21(3):371-397, 2010.
Xiong, Hui, Pang-Ning Tan, and Vipin Kumar (2003). Mining strong affinity association patterns in data sets with skewed support distribution. In Bart Goethals and Mohammed J. Zaki, editors, Proceedings of the IEEE International Conference on Data Mining, November 19--22, 2003, Melbourne, Florida, pages 387--394.
# NOT RUN {
data("Income")
rules <- apriori(Income)
## calculate a single measure and add it to the quality slot
quality(rules) <- cbind(quality(rules),
hyperConfidence = interestMeasure(rules, measure = "hyperConfidence",
transactions = Income))
inspect(head(rules, by = "hyperConfidence"))
## calculate several measures
m <- interestMeasure(rules, c("confidence", "oddsRatio", "leverage"),
transactions = Income)
inspect(head(rules))
head(m)
## calculate all available measures for the first 5 rules and show them as a
## table with the measures as rows
t(interestMeasure(head(rules, 5), transactions = Income))
## calculate measures on a different set of transactions (I use a sample here)
## Note: reuse = TRUE (default) would just return the stored support on the
## data set used for mining
newTrans <- sample(Income, 100)
m2 <- interestMeasure(rules, "support", transactions = newTrans, reuse = FALSE)
head(m2)
## calculate all available measures for the 5 frequent itemsets with highest support
its <- apriori(Income, parameter = list(target = "frequent itemsets"))
its <- head(its, 5, by = "support")
inspect(its)
interestMeasure(its, transactions = Income)
# }
Run the code above in your browser using DataLab