is.redundant: Find Redundant Rules

Description

Provides the generic function is.redundant() and the method to find redundant rules based on any interest measure.

Usage

is.redundant(x, ...)
# S4 method for rules
is.redundant(
  x,
  measure = "confidence",
  confint = FALSE,
  level = 0.95,
  smoothCounts = 1,
  ...
)

Value

returns a logical vector indicating which rules are redundant.

Arguments

x: a set of rules.
...: additional arguments are passed on to interestMeasure(), or, for confint = TRUE to confint().
measure: measure used to check for redundancy.
confint: should confidence intervals be used to the redundancy check?
level: confidence level for the confidence interval. Only used when confint = TRUE.
smoothCounts: adds a "pseudo count" to each count in the used contingency table. This implements addaptive smoothing (Laplace smoothing) for counts and avoids zero counts.

Author

Michael Hahsler and Christian Buchta

Details

Simple improvement-based redundancy: (confint = FALSE) A rule can be defined as redundant if a more general rules with the same or a higher confidence exists. That is, a more specific rule is redundant if it is only equally or even less predictive than a more general rule. A rule is more general if it has the same RHS but one or more items removed from the LHS. Formally, a rule $X \Rightarrow Y$ is redundant if

$$\exists X' \subset X \quad conf(X' \Rightarrow Y) \ge conf(X \Rightarrow Y).$$

This is equivalent to a negative or zero improvement as defined by Bayardo et al. (2000).

The idea of improvement can be extended other measures besides confidence. Any other measure available for function interestMeasure() (e.g., lift or the odds ratio) can be specified in measure.

Confidence interval-based redundancy: (confint = TRUE) Li et al (2014) propose to use the confidence interval (CI) of the odds ratio (OR) of rules to define redundancy. A more specific rule is redundant if it does not provide a significantly higher OR than any more general rule. Using confidence intervals as error bounds, a more specific rule is defined as redundant if its OR CI overlaps with the CI of any more general rule. This type of redundancy detection removes more rules than improvement since it takes differences in counts due to randomness in the dataset into account.

The odds ratio and the CI are based on counts which can be zero and which leads to numerical problems. In addition to the method described by Li et al (2014), we use additive smoothing (Laplace smoothing) to alleviate this problem. The default setting adds 1 to each count (see confint()). A different pseudocount (smoothing parameter) can be defined using the additional parameter smoothCounts. Smoothing can be disabled using smoothCounts = 0.

Warning: This approach of redundancy checking is flawed since rules with non-overlapping CIs are non-redundant (same result as for a 2-sample t-test), but overlapping CIs do not automatically mean that there is no significant difference between the two measures which leads to a higher type II error. At the same time, multiple comparisons are performed leading to an increased type I error. If we are more worried about missing important rules, then the type II error is more concerning.

Confidence interval-based redundancy checks can also be used for other measures with a confidence interval like confidence (see confint()).

References

Bayardo, R. , R. Agrawal, and D. Gunopulos (2000). Constraint-based rule mining in large, dense databases. Data Mining and Knowledge Discovery, 4(2/3):217--240.

Li, J., Jixue Liu, Hannu Toivonen, Kenji Satou, Youqiang Sun, and Bingyu Sun (2014). Discovering statistically non-redundant subgroups. Knowledge-Based Systems. 67 (September, 2014), 315--327. tools:::Rd_expr_doi("10.1016/j.knosys.2014.04.030")

Examples

Run this code


data("Income")

## mine some rules with the consequent "language in home=english"
rules <- apriori(Income, parameter = list(support = 0.5),
  appearance = list(rhs = "language in home=english"))

## for better comparison we add Bayado's improvement and sort by improvement
quality(rules)$improvement <- interestMeasure(rules, measure = "improvement")
rules <- sort(rules, by = "improvement")
inspect(rules)
is.redundant(rules)

## find non-redundant rules using improvement of confidence
## Note: a few rules have a very small improvement over the rule {} => {language in home=english}
rules_non_redundant <- rules[!is.redundant(rules)]
inspect(rules_non_redundant)

## use non-overlapping confidence intervals for the confidence measure instead
## Note: fewer rules have a significantly higher confidence
inspect(rules[!is.redundant(rules, measure = "confidence",
  confint = TRUE, level = 0.95)])

## find non-redundant rules using improvement of the odds ratio.
quality(rules)$oddsRatio <-  interestMeasure(rules, measure = "oddsRatio", smoothCounts = .5)
inspect(rules[!is.redundant(rules, measure = "oddsRatio")])

## use the confidence interval for the odds ratio.
## We see that no rule has a significantly better odds ratio than the most general rule.
inspect(rules[!is.redundant(rules, measure = "oddsRatio",
  confint = TRUE, level = 0.95)])

##  use the confidence interval for lift
inspect(rules[!is.redundant(rules, measure = "lift",
  confint = TRUE, level = 0.95)])

Run the code above in your browser using DataLab