nuggets
Extensible R framework for subgroup discovery (Atzmueller (2015)), contrast patterns (Chen (2022)), emerging patterns (Dong (1999)) and association rules (Agrawal (1994)). Both crisp (binary) and fuzzy data are supported. It generates conditions in the form of elementary conjunctions, evaluates them on a dataset and checks the induced sub-data for interesting statistical properties. Currently, the package searches for implicative association rules and conditional correlations (Hájek (1978)). A user-defined function may be defined to evaluate on each generated condition to search for custom patterns.
Installation
To install the stable version of nuggets
from CRAN, type the following
command within the R session:
install.packages("nuggets")
You can also install the development version of nuggets
from
GitHub with:
install.packages("devtools")
devtools::install_github("beerda/nuggets")
Examples
Search for Implicative Rules
We start with loading of the needed packages:
library(tidyverse)
library(nuggets)
We are going to use the CO2
dataset as an example:
head(CO2)
#> Plant Type Treatment conc uptake
#> 1 Qn1 Quebec nonchilled 95 16.0
#> 2 Qn1 Quebec nonchilled 175 30.4
#> 3 Qn1 Quebec nonchilled 250 34.8
#> 4 Qn1 Quebec nonchilled 350 37.2
#> 5 Qn1 Quebec nonchilled 500 35.3
#> 6 Qn1 Quebec nonchilled 675 39.2
First, the numeric columns need to be transformed to factors:
d <- mutate(CO2,
conc = cut(conc, c(-Inf, 175, 350, 675, Inf)),
uptake = cut(uptake, c(-Inf, 17.9, 28.3, 37.12)))
head(d)
#> Plant Type Treatment conc uptake
#> 1 Qn1 Quebec nonchilled (-Inf,175] (-Inf,17.9]
#> 2 Qn1 Quebec nonchilled (-Inf,175] (28.3,37.1]
#> 3 Qn1 Quebec nonchilled (175,350] (28.3,37.1]
#> 4 Qn1 Quebec nonchilled (175,350] <NA>
#> 5 Qn1 Quebec nonchilled (350,675] (28.3,37.1]
#> 6 Qn1 Quebec nonchilled (350,675] <NA>
Then every column can be dichotomized, i.e., dummy logical columns may be created for each factor level:
d <- dichotomize(d)
head(d)
#> # A tibble: 6 × 23
#> `Plant=Qn1` `Plant=Qn2` `Plant=Qn3` `Plant=Qc1` `Plant=Qc3` `Plant=Qc2`
#> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 TRUE FALSE FALSE FALSE FALSE FALSE
#> 2 TRUE FALSE FALSE FALSE FALSE FALSE
#> 3 TRUE FALSE FALSE FALSE FALSE FALSE
#> 4 TRUE FALSE FALSE FALSE FALSE FALSE
#> 5 TRUE FALSE FALSE FALSE FALSE FALSE
#> 6 TRUE FALSE FALSE FALSE FALSE FALSE
#> # ℹ 17 more variables: `Plant=Mn3` <lgl>, `Plant=Mn2` <lgl>, `Plant=Mn1` <lgl>,
#> # `Plant=Mc2` <lgl>, `Plant=Mc3` <lgl>, `Plant=Mc1` <lgl>,
#> # `Type=Quebec` <lgl>, `Type=Mississippi` <lgl>,
#> # `Treatment=nonchilled` <lgl>, `Treatment=chilled` <lgl>,
#> # `conc=(-Inf,175]` <lgl>, `conc=(175,350]` <lgl>, `conc=(350,675]` <lgl>,
#> # `conc=(675, Inf]` <lgl>, `uptake=(-Inf,17.9]` <lgl>,
#> # `uptake=(17.9,28.3]` <lgl>, `uptake=(28.3,37.1]` <lgl>
Before starting to search for the rules, it is good idea to create the
vector of disjoints. Columns with equal values in the disjoint vector
will not be combined together. This will speed-up the search as it makes
no sense, e.g., to combine Plant=Qn1
and Plant=Qn2
in a single
condition.
disj <- sub("=.*", "", colnames(d))
print(disj)
#> [1] "Plant" "Plant" "Plant" "Plant" "Plant" "Plant"
#> [7] "Plant" "Plant" "Plant" "Plant" "Plant" "Plant"
#> [13] "Type" "Type" "Treatment" "Treatment" "conc" "conc"
#> [19] "conc" "conc" "uptake" "uptake" "uptake"
Once the data are prepared, the dig_implications
function may be
invoked. It takes the dataset as its first parameter and a pair of
“tidyselect” expressions to select the column names to appear in the
left- and right-hand side of the rule (antecedent and consequent).
result <- dig_implications(d,
antecedent = !starts_with("Treatment"),
consequent = starts_with("Treatment"),
disjoint = disj,
min_support = 0.02,
min_confidence = 0.8)
result <- arrange(result, desc(support))
print(result)
#> # A tibble: 225 × 7
#> antecedent consequent support confidence coverage lift count
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 {Type=Mississippi,uptake=… {Treatmen… 0.155 0.813 0.190 1.63 16
#> 2 {Type=Mississippi,uptake=… {Treatmen… 0.119 1 0.119 2 10
#> 3 {Plant=Mc3} {Treatmen… 0.0833 1 0.0833 2 7
#> 4 {Plant=Mc1} {Treatmen… 0.0833 1 0.0833 2 7
#> 5 {Plant=Qn1} {Treatmen… 0.0833 1 0.0833 2 7
#> 6 {Plant=Mc2} {Treatmen… 0.0833 1 0.0833 2 7
#> 7 {Plant=Mn1} {Treatmen… 0.0833 1 0.0833 2 7
#> 8 {Plant=Mn2} {Treatmen… 0.0833 1 0.0833 2 7
#> 9 {Plant=Mn3} {Treatmen… 0.0833 1 0.0833 2 7
#> 10 {Plant=Qc2} {Treatmen… 0.0833 1 0.0833 2 7
#> # ℹ 215 more rows
Custom Pattern Search
The nuggets
package allows to execute a user-defined callback function
on each generated frequent condition. That way a custom type of patterns
may be searched. The following example replicates the search for
implicative rules with the custom callback function. For that, a dataset
has to be dichotomized and the disjoint vector created as in the
previous example:
head(d)
#> # A tibble: 6 × 23
#> `Plant=Qn1` `Plant=Qn2` `Plant=Qn3` `Plant=Qc1` `Plant=Qc3` `Plant=Qc2`
#> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 TRUE FALSE FALSE FALSE FALSE FALSE
#> 2 TRUE FALSE FALSE FALSE FALSE FALSE
#> 3 TRUE FALSE FALSE FALSE FALSE FALSE
#> 4 TRUE FALSE FALSE FALSE FALSE FALSE
#> 5 TRUE FALSE FALSE FALSE FALSE FALSE
#> 6 TRUE FALSE FALSE FALSE FALSE FALSE
#> # ℹ 17 more variables: `Plant=Mn3` <lgl>, `Plant=Mn2` <lgl>, `Plant=Mn1` <lgl>,
#> # `Plant=Mc2` <lgl>, `Plant=Mc3` <lgl>, `Plant=Mc1` <lgl>,
#> # `Type=Quebec` <lgl>, `Type=Mississippi` <lgl>,
#> # `Treatment=nonchilled` <lgl>, `Treatment=chilled` <lgl>,
#> # `conc=(-Inf,175]` <lgl>, `conc=(175,350]` <lgl>, `conc=(350,675]` <lgl>,
#> # `conc=(675, Inf]` <lgl>, `uptake=(-Inf,17.9]` <lgl>,
#> # `uptake=(17.9,28.3]` <lgl>, `uptake=(28.3,37.1]` <lgl>
print(disj)
#> [1] "Plant" "Plant" "Plant" "Plant" "Plant" "Plant"
#> [7] "Plant" "Plant" "Plant" "Plant" "Plant" "Plant"
#> [13] "Type" "Type" "Treatment" "Treatment" "conc" "conc"
#> [19] "conc" "conc" "uptake" "uptake" "uptake"
As we want to search for implicative rules with some minimum support and confidence, we define the variables to hold that thresholds. We also need to define a callback function that will be called for each found frequent condition. Its purpose is to generate the rules with the obtained condition as an antecedent:
min_support <- 0.02
min_confidence <- 0.8
f <- function(condition, support, foci_supports) {
conf <- foci_supports / support
sel <- !is.na(conf) & conf >= min_confidence & !is.na(foci_supports) & foci_supports >= min_support
conf <- conf[sel]
supp <- foci_supports[sel]
lapply(seq_along(conf), function(i) {
list(antecedent = format_condition(names(condition)),
consequent = format_condition(names(conf)[[i]]),
support = supp[[i]],
confidence = conf[[i]])
})
}
The callback function f()
defines three arguments: condition
,
support
and foci_supports
. The names of the arguments are not
random. Based on the argument names of the callback function, the
searching algorithm provides information to the function. Here
condition
is a vector of indices representing the conjunction of
predicates in a condition. By the predicate we mean the column in the
source dataset. The support
argument gets the relative frequency of
the condition in the dataset. foci_supports
is a vector of supports of
special predicates, which we call “foci” (plural of “focus”), within the
rows satisfying the condition. For implicative rules, foci are potential
rule consequents.
Now we can run the digging for rules:
result <- dig(d,
f = f,
condition = !starts_with("Treatment"),
focus = starts_with("Treatment"),
disjoint = disj,
min_length = 1,
min_support = min_support)
As we return a list of lists in the callback function, we have to flatten the first level of lists in the result and binding it into a data frame:
result <- result %>%
unlist(recursive = FALSE) %>%
map(as_tibble) %>%
do.call(rbind, .) %>%
arrange(desc(support))
print(result)
#> # A tibble: 225 × 4
#> antecedent consequent support confidence
#> <chr> <chr> <dbl> <dbl>
#> 1 {Type=Mississippi,uptake=(-Inf,17.9]} {Treatment=chilled} 0.155 0.813
#> 2 {Type=Mississippi,uptake=(28.3,37.1]} {Treatment=nonchill… 0.119 1
#> 3 {Plant=Mc3} {Treatment=chilled} 0.0833 1
#> 4 {Plant=Mc1} {Treatment=chilled} 0.0833 1
#> 5 {Plant=Qn1} {Treatment=nonchill… 0.0833 1
#> 6 {Plant=Mc2} {Treatment=chilled} 0.0833 1
#> 7 {Plant=Mn1} {Treatment=nonchill… 0.0833 1
#> 8 {Plant=Mn2} {Treatment=nonchill… 0.0833 1
#> 9 {Plant=Mn3} {Treatment=nonchill… 0.0833 1
#> 10 {Plant=Qc2} {Treatment=chilled} 0.0833 1
#> # ℹ 215 more rows