Learn R Programming

diceR

Overview

The goal of diceR is to provide a systematic framework for generating diverse cluster ensembles in R. There are a lot of nuances in cluster analysis to consider. We provide a process and a suite of functions and tools to implement a systematic framework for cluster discovery, guiding the user through the generation of a diverse clustering solutions from data, ensemble formation, algorithm selection and the arrival at a final consensus solution. We have additionally developed visual and analytical validation tools to help with the assessment of the final result. We implemented a wrapper function dice() that allows the user to easily obtain results and assess them. Thus, the package is accessible to both end user with limited statistical knowledge. Full access to the package is available for informaticians and statisticians and the functions are easily expanded. More details can be found in our companion paper published at BMC Bioinformatics.

Installation

You can install diceR from CRAN with:

install.packages("diceR")

Or get the latest development version from GitHub:

# install.packages("devtools")
devtools::install_github("AlineTalhouk/diceR")

Example

The following example shows how to use the main function of the package, dice(). A data matrix hgsc contains a subset of gene expression measurements of High Grade Serous Carcinoma Ovarian cancer patients from the Cancer Genome Atlas publicly available datasets. Samples as rows, features as columns. The function below runs the package through the dice() function. We specify (a range of) nk clusters over reps subsamples of the data containing 80% of the full samples. We also specify the clustering algorithms to be used and the ensemble functions used to aggregated them in cons.funs.

library(diceR)
data(hgsc)
obj <- dice(
  hgsc,
  nk = 4,
  reps = 5,
  algorithms = c("hc", "diana"),
  cons.funs = c("kmodes", "majority"),
  progress = FALSE,
  verbose = FALSE
)

The first few cluster assignments are shown below:

knitr::kable(head(obj$clusters))
kmodesmajority
TCGA.04.1331_PRO.C522
TCGA.04.1332_MES.C122
TCGA.04.1336_DIF.C442
TCGA.04.1337_MES.C122
TCGA.04.1338_MES.C122
TCGA.04.1341_PRO.C522

You can also compare the base algorithms with the cons.funs using internal evaluation indices:

knitr::kable(obj$indices$ii$`4`)
Algorithmscalinski_harabaszdunnpbmtaugammac_indexdavies_bouldinmcclain_raosd_disray_turig_plussilhouettes_dbwCompactnessConnectivity
HC_EuclideanHC_Euclidean3.1041060.260854759.7371100.42857140.28440731.8391820.80091490.13060621.47656650NaNNaN24.8322541.62183
DIANA_EuclideanDIANA_Euclidean53.6474000.334810333.878170-1.87500000.15894422.8242010.80519150.21192813.297898600.0692233NaN21.93396241.66310
kmodeskmodes55.1386000.339690950.517220-0.68224300.14535992.0067520.79729990.11708291.140825800.1253664NaN21.91494201.42540
majoritymajority19.3732480.354437185.051730-1.16513760.21024871.6227990.80194530.11086740.920051100.1884934NaN23.8540864.04921

Pipeline

This figure is a visual schematic of the pipeline that dice() implements.

Please visit the overview page for more detail.

Copy Link

Version

Install

install.packages('diceR')

Monthly Downloads

707

Version

3.0.0

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Derek Chiu

Last Published

February 5th, 2025

Functions in diceR (3.0.0)

impute_missing

Impute missing values
min_fnorm

Minimize Frobenius norm for between two matrices
graphs

Graphical Displays
impute_knn

K-Nearest Neighbours imputation
diceR-package

diceR: Diverse Cluster Ensemble in R
sigclust

Significant Testing of Clustering Results
hgsc

Gene expression data for High Grade Serous Carcinoma from TCGA
prepare_data

Prepare data for consensus clustering
relabel_class

Relabel classes to a standard
similarity

Similarity Matrices
LCA

Latent Class Analysis
consensus_cluster

Consensus clustering
consensus_combine

Combine algorithms
CSPA

Cluster-based Similarity Partitioning Algorithm (CSPA)
consensus_evaluate

Evaluate, trim, and reweigh algorithms
PAC

Proportion of Ambiguous Clustering
LCE

Linkage Clustering Ensemble
consensus_matrix

Consensus matrix
dice

Diverse Clustering Ensemble
compactness

Compactness Measure
pcn

Simulate and select null distributions on empirical gene-gene correlations
majority_voting

Majority voting
k_modes

K-modes
external_validity

External validity indices