interestMeasure: Calculate Additional Interest Measures

Description

Provides the generic function interestMeasure and the needed S4 method to calculate various additional interest measures for existing sets of itemsets or rules. A searchable list of definitions, equations and references for all available interest measures can be found here: https://mhahsler.github.io/arules/docs/measures

Usage

interestMeasure(x, measure, transactions = NULL, reuse = TRUE, ...)

Arguments

a set of itemsets or rules.

measure

name or vector of names of the desired interest measures (see details for available measures). If measure is missing then all available measures are calculated.

transactions

the transaction data set used to mine the associations or a set of different transactions to calculate interest measures from (Note: you need to set reuse=FALSE in the later case).

reuse

logical indicating if information in quality slot should be reuse for calculating the measures. This speeds up the process significantly since only very little (or no) transaction counting is necessary if support, confidence and lift are already available. Use reuse=FALSE to force counting (might be very slow but is necessary if you use a different set of transactions than was used for mining).

…

further arguments for the measure calculation. Many measures are based on contingency table counts and zero counts can produce NaN values (division by zero). This issue can be resolved by using the additional parameter smoothCounts which performs additive smoothing by adds a "pseudo count" of smoothCounts to each count in the contingency table. Use smoothCounts = 1 or larger values for Laplace smoothing. Use smoothCounts = .5 for Haldane-Anscombe correction often used for chi-squared, phi correlation and related measures.

Value

If only one measure is used, the function returns a numeric vector containing the values of the interest measure for each association in the set of associations x.

If more than one measures are specified, the result is a data.frame containing the different measures for each association as columns.

NA is returned for rules/itemsets for which a certain measure is not defined.

Details

The following measures are implemented for itemsets \(X\):

"allConfidence"

Is defined on itemsets as the minimum confidence of all possible rule generated from the itemset.

See details: https://mhahsler.github.io/arules/docs/measures#all-confidence

Range: \([0, 1]\)

"crossSupportRatio", cross-support ratio

Defined on itemsets as the ratio of the support of the least frequent item to the support of the most frequent item. Cross-support patterns have a ratio smaller than a set threshold. Normally many found patterns are cross-support patterns which contain frequent as well as rare items. Such patterns often tend to be spurious.

See details: https://mhahsler.github.io/arules/docs/measures#cross-support-ratio

Range: \([0, 1]\)

"lift"

Lift is typically only defined for rules. In a similar way, we use the probability (support) of the itemset over the product of the probabilities of all items in the itemset, i.e., \(\frac{supp(X)}{\prod_{x \in X} supp(X)}\).

Range: \([0, \infty]\) (1 indicated independence)

"support", supp

Support is an estimate of \(P(X)\), a measure of generality of the itemset. It is estimated by the number of transactions that contain the itemset over the total number of transactions in the data set.

See details: https://mhahsler.github.io/arules/docs/measures#support

Range: \([0, 1]\)

"count"

Absolute support count of the itemset, i.e., the number of transactions that contain the itemset.

See details: https://mhahsler.github.io/arules/docs/measures#support

Range: \([0, \infty]\)

The following measures are implemented for rules of the form \(X \Rightarrow Y\):

"addedValue", added Value, AV, Pavillon index, centered confidence

Defined as the rule confidence minus the rule's support.

See details: https://mhahsler.github.io/arules/docs/measures#added-value

Range: \([-.5, 1]\)

"boost", confidence boost

Confidence boost is the ratio of the confidence of a rule to the confidence of any more general rule (i.e., a rule with the same consequent but one or more items removed in the LHS). Values larger than 1 mean the new rule boosts the confidence compared to the best, more general rule. The measure is related to the improvement measure.

See details: https://mhahsler.github.io/arules/docs/measures#confidence-boost

Range: \([0, \infty]\)

"chiSquared", \(\chi^2\) statistic

The chi-squared statistic to test for independence between the lhs and rhs of the rule. The critical value of the chi-squared distribution with \(1\) degree of freedom (2x2 contingency table) at \(\alpha=0.05\) is \(3.84\); higher chi-squared values indicate that the lhs and the rhs are not independent.

See details: https://mhahsler.github.io/arules/docs/measures#chi-squared

Note that the contingency table is likely to have cells with low expected values and that thus Fisher's Exact Test might be more appropriate (see below).

Called with significance = TRUE, the p-value of the test for independence is returned instead of the chi-squared statistic. For p-values, substitution effects (the ocurrence of one item makes the ocurrance of another item less likely) can be tested using the parameter complements = FALSE. Correction for multiple comparisons can be done using p.adjust.

Range: \([0, \infty]\) or p-value scale

"certainty", certainty factor, CF, Loevinger

The certainty factor is a measure of variation of the probability that Y is in a transaction when only considering transactions with X. An increasing CF means a decrease of the probability that Y is not in a transaction that X is in. Negative CFs have a similar interpretation.

See details: https://mhahsler.github.io/arules/docs/measures#certainty-factor

Range: \([-1, 1]\) (0 indicates independence)

"collectiveStrength", Collective strength, S

Collective strength gives 0 for perfectly negative correlated items, infinity for perfectly positive correlated items, and 1 if the items co-occur as expected under independence.

See details: https://mhahsler.github.io/arules/docs/measures#collective-strength

Range: \([0, \infty]\)

"confidence", Strength, conf

Confidence is a measure of rule validity. Rule confidence is an estimate of \(P(Y|X)\).

See details: https://mhahsler.github.io/arules/docs/measures#confidence

Range: \([0, 1]\)

"conviction"

Conviction was developed as an alternative to lift that also incorporates the direction of the rule.

See details: https://mhahsler.github.io/arules/docs/measures#conviction

Range: \([0, \infty]\) (\(1\) indicates unrelated items)

"cosine"

A measure if correlation between the items in X and Y.

See details: https://mhahsler.github.io/arules/docs/measures#cosine

Range: \([0, 1]\)(\(.5\) indicates no correlation)

"count"

Absolute support count of the rule, i.e., the number of transactions that contain all items in the rule.

See details: https://mhahsler.github.io/arules/docs/measures#support

Range: \([0, \infty]\)

"coverage", cover, LHS-support

It measures the probability that a rule applies to a randomly selected transaction. It is estimated by the proportion of transactions that contain the antecedent (LHS) of the rule. Therefore, coverage is sometimes called antecedent support or LHS support.

See details: https://mhahsler.github.io/arules/docs/measures#coverage

Range: \([0, 1]\)

"confirmedConfidence", descriptive confirmed confidence

How much higher is the confidence of a rule compared to the confidence of the rule \(X \Rightarrow \overline{Y}\).

See details: https://mhahsler.github.io/arules/docs/measures#descriptive-confirmed-confidence

Range: \([-1, 1]\)

"casualConfidence", casual confidence

Confidence reinforced by the confidence of the rule \(\overline{X} \Rightarrow \overline{Y}\).

See details: https://mhahsler.github.io/arules/docs/measures#casual-confidence

Range: \([0, 1]\)

"casualSupport", casual support

Support reinforced by the support of the rule \(\overline{X} \Rightarrow \overline{Y}\).

See details: https://mhahsler.github.io/arules/docs/measures#casual-support Range: \([-1, 1]\)

"counterexample", example and counter-example rate

Rate of the examples minus the rate of counter examples (i.e., \(X \Rightarrow \overline{Y}\)).

See details: https://mhahsler.github.io/arules/docs/measures#example-and-counter-example-rate

Range: \([0, 1]\)

"doc", difference of confidence

Defined as the difference in confidence of the rule and the rule \(\overline{X} \Rightarrow Y\)

See details: https://mhahsler.github.io/arules/docs/measures#difference-of-confidence Range: \([-1, 1]\)

"fishersExactTest", Fisher's exact test

p-value of Fisher's exact test used in the analysis of contingency tables where sample sizes are small. By default complementary effects are mined, substitutes can be found by using the parameter complements = FALSE.

See details: https://mhahsler.github.io/arules/docs/measures#fishers-exact-test

Note that it is equal to hyper-confidence with significance=TRUE. Correction for multiple comparisons can be done using p.adjust.

Range: \([0, 1]\) (p-value scale)

"gini", Gini index

Measures quadratic entropy of a rule.

See details: https://mhahsler.github.io/arules/docs/measures#gini-index

Range: \([0, 1]\) (0 means the rule provides no information for the data set)

"hyperConfidence"

Confidence level that the observed co-occurrence count of the LHS and RHS is too high given the expected count using the hypergeometric model.

See details: https://mhahsler.github.io/arules/docs/measures#hyper-confidence

Hyper-confidence reports the confidence level by default and the significance level if significance=TRUE is used.

By default complementary effects are mined, substitutes (too low co-occurrence counts) can be found by using the parameter complements = FALSE.

Range: \([0, 1]\)

"hyperLift"

Adaptation of the lift measure which evaluates the deviation from independence using a quantile of the hypergeometric distribution defined by the counts of the LHS and RHS. HyperLift can be used to calculate confidence intervals for the lift measure.

The used quantile can be given as parameter level (default: level = 0.99).

See details: https://mhahsler.github.io/arules/docs/measures#hyper-lift

Range: \([0, \infty]\) (1 indicates independence)

"imbalance", imbalance ratio, IR

IR measures the degree of imbalance between the two events that the lhs and the rhs are contained in a transaction. The ratio is close to 0 if the conditional probabilities are similar (i.e., very balanced) and close to 1 if they are very different. See also: https://mhahsler.github.io/arules/docs/measures#imbalance-ratio

Range: \([0, 1]\) (0 indicates a balanced rule)

"implicationIndex", implication index

A variation of the Lerman similarity.

See details: https://mhahsler.github.io/arules/docs/measures#implication-index

Range: \([0, 1]\) (0 means independence)

"importance"

Log likelihood of the right-hand side of the rule, given the left-hand side of the rule using Laplace corrected confidence.

See details: https://mhahsler.github.io/arules/docs/measures#importance

Range: \([-Inf, Inf]\)

"improvement"

The improvement of a rule is the minimum difference between its confidence and the confidence of any more general rule (i.e., a rule with the same consequent but one or more items removed in the LHS).

Special case: We define improvement for a rules with an empty LHS as its confidence.

The idea of improvement can be generalized to other measures than confidence. Other measures like lift can be specified with the extra parameter improvementMeasure.

See details: https://mhahsler.github.io/arules/docs/measures#improvement

Range: \([0, 1]\)

"jaccard", Jaccard coefficient, sometimes also called Coherence

Null-invariant measure of dependence defined as the Jaccard similarity between the two sets of transactions that contain the items in X and Y, respectively.

See details: https://mhahsler.github.io/arules/docs/measures#jaccard-coefficient

Range: \([0, 1]\)

"jMeasure", J-measure, J

A scaled measures of cross entropy to measure the information content of a rule.

See details: https://mhahsler.github.io/arules/docs/measures#j-measure

Range: \([0, 1]\) (0 indicates X does not provide information for Y)

"kappa" Cohen's Kappa (Tan and Kumar, 2000)

Cohen's Kappa of the rule (seen as a classifier) given as the rule's observed rule accuracy (i.e., confidence) corrected by the expected accuracy (of a random classifier).

See details: https://mhahsler.github.io/arules/docs/measures#kappa

Range: \([-1,1]\) (0 means the rule is not better than a random classifier)

"klosgen"

Defined as \(\sqrt{supp(X \cup Y)} conf(X \Rightarrow Y) - supp(Y)\)

See details: https://mhahsler.github.io/arules/docs/measures#klosgen

Range: \([-1, 1]\) (0 for independence)

"kulczynski", kulc

Calculate the null-invariant Kulczynski measure with a preference for skewed patterns.

See details: https://mhahsler.github.io/arules/docs/measures#kulczynski

Range: \([0, 1]\)

"lambda", Goodman-Kruskal's \(\lambda\), predictive association

Goodman and Kruskal's lambda to assess the association between the LHS and RHS of the rule.

See details: https://mhahsler.github.io/arules/docs/measures#lambda

Range: \([0, 1]\)

"laplace", Laplace corrected confidence/accuracy, L

Estimates confidence by increasing each count by 1. Parameter k can be used to specify the number of classes (default is 2). Prevents counts of 0 and L decreases with lower support.

See details: https://mhahsler.github.io/arules/docs/measures#laplace-corrected-confidence

Range: \([0, 1]\)

"leastContradiction", least contradiction

Probability of finding a matching transaction minus the probability of finding a contradicting transaction normalized by the probability of finding a transaction containing Y.

See details: https://mhahsler.github.io/arules/docs/measures#least-contradiction

Range: \([-1, 1]\)

"lerman", Lerman similarity

Defined as \(\sqrt{N} \frac{supp(X \cup Y) - supp(X)supp(Y)}{\sqrt{supp(X)supp(Y)}}\)

See details: https://mhahsler.github.io/arules/docs/measures#lerman-similarity

Range: \([0, 1]\)

"leverage", Piatetsky-Shapiro Measure, PS

PS measures the difference of X and Y appearing together in the data set and what would be expected if X and Y where statistically dependent. It can be interpreted as the gap to independence.

See details: https://mhahsler.github.io/arules/docs/measures#leverage

Range: \([-1, 1]\) (0 indicates independence)

"lift", interest factor

Lift quantifies dependence between X and Y by comparing the probability that X and Y are contained in a transaction to the expected probability under independence (i.e., the product of the probabilities that X is contained in a transaction times the probability that Y is contained in a transaction).

See details: https://mhahsler.github.io/arules/docs/measures#lift

Range: \([0, \infty]\) (1 means independence between LHS and RHS)

"maxConfidence"

Null-invariant symmetric measure defined as the larger of the confidence of the rule or the rule with X and Y exchanged.

See details: https://mhahsler.github.io/arules/docs/measures#maxconfidence Range: \([0, 1]\)

"mutualInformation", uncertainty, M

Measures the information gain for Y provided by X.

See details: https://mhahsler.github.io/arules/docs/measures#mutual-information

Range: \([0, 1]\) (0 means that X does not provide information for Y)

"oddsRatio", odds ratio

The odds of finding X in transactions which contain Y divided by the odds of finding X in transactions which do not contain Y. For zero counts, Haldane-Anscombe correction (adding .5 to all zells) is applied.

See details: https://mhahsler.github.io/arules/docs/measures#odds_ratio

Range: \([0, \infty]\) (\(1\) indicates that Y is not associated to X)

"oddsRatioCI", odds ratio confidence interval

Calculates the lower and upper bounds of the confidence interval around the odds ratio (using a normal approximation). The used confidence level defaults to 0.95, but can be adjusted with the additional parameter confidenceLevel.

See details: https://mhahsler.github.io/arules/docs/measures#odds-ratio

Range: \([0, \infty]\)

"phi", correlation coefficient \(\phi\)

Correlation coefficient between the transactions containing X and Y represented as two binary vectors. Phi correlation is equivalent to Pearson's Product Moment Correlation Coefficient \(\rho\) with 0-1 values.

See details: https://mhahsler.github.io/arules/docs/measures#phi-correlation-coefficient Range: \([-1, 1]\) (0 when X and Y are independent)

"ralambondrainy", Ralambondrainy Measure

The measure is defined as the probability that a transaction contains X but not Y. A smaller value is better.

See details: https://mhahsler.github.io/arules/docs/measures#ralambondrainy

Range: \([0, 1]\)

"rhsSupport", Support of the rule consequent

Range: \([0, 1]\)

"RLD", relative linkage disequilibrium

RLD is an association measure motivated by indices used in population genetics. It evaluates the deviation of the support of the whole rule from the support expected under independence given the supports of the LHS and the RHS.

See details: https://mhahsler.github.io/arules/docs/measures#relative-linkage-disequilibrium

The code was contributed by Silvia Salini.

Range: \([0, 1]\)

"rulePowerFactor", rule power factor

Product of support and confidence. Can be seen as rule confidence weighted by support.

See details: https://mhahsler.github.io/arules/docs/measures#rule-power-factor

Range: \([0, 1]\)

"sebag", Sebag-Schoenauer measure

Confidence of a rule divided by the confidence of the rule \(X \Rightarrow \overline{Y}\).

See details: https://mhahsler.github.io/arules/docs/measures#sebag-schoenauer

Range: \([0, 1]\)

"stdLift", Standardized Lift

Standardized lift uses the minimum and maximum lift can reach for each rule to standardize lift between 0 and 1. By default, the measure is corrected for minimum support and minimum confidence. Correction can be disabled by using the argument correct = FALSE.

See details: https://mhahsler.github.io/arules/docs/measures#standardized-lift

Range: \([0, 1]\)

"support", supp

Support is an estimate of \(P(X \cup Y)\) and measures the generality of the rule.

See details: https://mhahsler.github.io/arules/docs/measures#support

Range: \([0, 1]\)

"table"

Returns the counts for the contingency table. The values are labeled \(n_{XY}\) where \(X\) and \(Y\) represent the presence (1) or absence (0) of the LHS and RHS of the rule, respectively. If several measures are specified, then the counts have the prefix table.

Range: counts

"varyingLiaison", varying rates liaison

Defined as the lift of a rule minus 1 so 0 represents independence.

See details: https://mhahsler.github.io/arules/docs/measures#Varying-Rates-Liaison

Range: \([-1, \infty]\) (0 for independence)

"yuleQ", Yule's Q

Defined as \(\frac{\alpha-1}{\alpha+1}\) where \(\alpha\) is the odds ratio.

See details: https://mhahsler.github.io/arules/docs/measures#yules-q-and-yules-y

Range: \([-1, 1]\)

"yuleY", Yule's Y

Defined as \(\frac{\sqrt{\alpha}-1}{\sqrt{\alpha}+1}\) where \(\alpha\) is the odds ratio.

See details: https://mhahsler.github.io/arules/docs/measures#yules-q-and-yules-y

Range: \([-1, 1]\)

References

A complete list of references for each individual measure is available in the following document:

Hahsler, Michael (2015). A Probabilistic Comparison of Commonly Used Interest Measures for Association Rules, 2015, URL: https://mhahsler.github.io/arules/docs/measures.

Examples

Run this code

# NOT RUN {
data("Income")
rules <- apriori(Income)

## calculate a single measure and add it to the quality slot
quality(rules) <- cbind(quality(rules), 
	hyperConfidence = interestMeasure(rules, measure = "hyperConfidence", 
	transactions = Income))

inspect(head(rules, by = "hyperConfidence"))

## calculate several measures
m <- interestMeasure(rules, c("confidence", "oddsRatio", "leverage"), 
	transactions = Income)
inspect(head(rules))
head(m)

## calculate all available measures for the first 5 rules and show them as a 
## table with the measures as rows
t(interestMeasure(head(rules, 5), transactions = Income))

## calculate measures on a different set of transactions (I use a sample here)
## Note: reuse = TRUE (default) would just return the stored support on the
##	data set used for mining
newTrans <- sample(Income, 100)
m2 <- interestMeasure(rules, "support", transactions = newTrans, reuse = FALSE) 
head(m2)

## calculate all available measures for the 5 frequent itemsets with highest support
its <- apriori(Income, parameter = list(target = "frequent itemsets"))
its <- head(its, 5, by = "support")
inspect(its)

interestMeasure(its, transactions = Income)
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

Details

References

See Also

Examples