Learn R Programming

jackstraw: Statistical Inference for Unsupervised Learning

This R package performs association tests between the observed data and their systematic patterns of variation. Systematic variation can be modeled by latent variables, that can arise from biological processes, experimental conditions, environmental factors, and others. We often estimate these patterns using principal component analysis (PCA), factor analysis (FA), logistic factor analysis (LFA), K-means clustering, partition around medoids (PAM), and related methods. The jackstraw methods learn over-fitting characteristics inherent in unsupervised learning, where the observed data are used to estimate the systematic patterns and to be tested again (see circular analysis).

Using a variety of unsupervised learning techniques, the jackstraw provides a resampling strategy and testing scheme to estimate statistical significance of association between the observed data and their systematic patterns of variation. For example, the cell cycle in microarray data may be estimated by principal components (PCs). Then, we can use the jackstraw for PCA to identify genes that are significantly associated with these PCs. On the other hand, cell identities in single cell RNA-seq (scRNA-seq) data are often determined by K-means clustering or other unsupervised clustering algorithms. Then, the jackstraw for clustering can identify single cells that are significant members of a given cluster.

Use cases

Using jackstraw_pca, we can find statistically significant variables with regard to the top r principal components (PCs). If we only specify r, we conduct association tests with all r PCs simultaneously. Alternatively, we could test association with respect to a subset of r PCs, using an optional argument r1. By specifying r (a total number of significant PCs) and r1 (a numeric vector of target PCs), jackstraw_pca helps find statistically significant variables with respect to r1 PCs, while accounting for the fact that there are r significant PCs. The package also supports truncated PCA, using augmented implicitly restarted Lanczos bidiagonalization algorithm (IRLBA; jackstraw_irlba) or randomized Singular Value Decomposition (RSVD; jackstraw_rpca).

Logistic factor analysis (LFA) estimates population structure from genetic data (single-nucleotide polymorphisms; SNPs). jackstraw_lfa provides corresponding association tests between SNPs and population structure, as estimated by LFA. Due to the requirements of a CRAN package, please manually install lfa from Bioconductor. See the R help on lfa. In general, one could directly specify an estimation method for latent variables in jackstraw_subspace.

Instead of continuous latent variables that are estimated by PCA, LFA, or others, one may be interested in estimating discrete clusters from a high dimensional data. For K-means clustering, jackstraw_kmeans evaluates whether data points are significant members of a given cluster, by testing association between observed data and cluster centers. This can help select data points that are reliable members of clusters and further improve the cluster membership. Note that in order to use the jackstraw for clustering, it's necessary to first apply the clustering algorithm to the data and provide the resulting object (e.g., kmeans.dat).

Related algorithms, such as Partitioning Around Medoids (PAM) or k-medoids and Mini Batch K-means algorithms, are supported by jackstraw_pam and jackstraw_MiniBatchKmeans, respectively. Generally, jackstraw_cluster can be used for other clustering algorithms.

There are few additional functions to support statistical inference for unsupervised learning, such as finding a number of PCs or clusters. Based on p-values, we could estimate posterior inclusion probabilities (PIPs) using pip.

References

Chung, N.C. (2020) Statistical significance of cluster membership for unsupervised evaluation of cell identities. Bioinformatics, 36(10): 3107–3114 https://doi.org/10.1093/bioinformatics/btaa087

Chung, N.C. and Storey, J.D. (2015) Statistical significance of variables driving systematic variation in high-dimensional data. Bioinformatics, 31(4): 545-554 https://doi.org/10.1093/bioinformatics/btu674

Short Tutorials

Association Test with Principal Components with a Gentle Introduction to Latent Variable Models

Statistical Test of Cluster Memberships with a Toy Data Set (mtcars)

Unsupervised Evaluation of Cell Identities in Single Cell Genomics using the 10X Genomics Data

Installation

Bioconductor dependencies

Install Bioconductor dependencies, lfa, gcatest, qvalue, manually first:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c('qvalue', 'lfa', 'gcatest'))

The following jackstraw functions requires Bioconductor packages:

  • jackstraw_lfa, pseudo_Rsq, efron_Rsq requires lfa.
  • jackstraw_lfa and jackstraw_alstructure requires gcatest.
  • pip requires the package qvalue.

Development Version on GitHub

This package is in active development. Install jackstraw from GitHub:

install.packages("devtools")
library("devtools")
install_github("ncchung/jackstraw")

To use jackstraw_alstructure, install the optional alstructure package from GitHub:

library(devtools)
install_github("StoreyLab/alstructure")

Stable Version on CRAN

The stable version jackstraw v1.3.17 is on CRAN. To install from CRAN:

install.packages("jackstraw")

Implementations and Extensions

Here are some implementations of the jackstraw in different contexts and application domains.

Implementation of the jackstraw in Python is available:

jackstraw (Python) by Iain Carmichael

Extension of Jackstraw Inference for AJIVE Data Integration:

Jackstraw significance testing for JIVE in Python

The jackstraw used in Seurat, R toolkit for single cell genomics:

Guided Clustering Tutorial

Determine statistical significance of PCA scores

Seurat Wizard (GUI Web App)

Copy Link

Version

Install

install.packages('jackstraw')

Monthly Downloads

13,651

Version

1.3.17

License

GPL-2

Maintainer

Last Published

September 16th, 2024

Functions in jackstraw (1.3.17)

efron_Rsq

Efron's Pseudo R-squared
jackstraw_kmeanspp

Non-Parametric Jackstraw for K-means Clustering using RcppArmadillo
jackstraw_MiniBatchKmeans

Non-Parametric Jackstraw for Mini Batch K-means Clustering
jackstraw_cluster

Jackstraw for the User-Defined Clustering Algorithm
jackstraw

jackstraw: Statistical Inference for Unsupervised Learning
jackstraw_alstructure

Non-Parametric Jackstraw for ALStructure
jackstraw_kmeans

Non-Parametric Jackstraw for K-means Clustering
find_k

Find a number of clusters or principal components
jackstraw_irlba

Non-Parametric Jackstraw for Principal Component Analysis (PCA) using the augmented implicitly restarted Lanczos bidiagonalization algorithm (IRLBA)
jackstraw_lfa

Non-Parametric Jackstraw for Logistic Factor Analysis
jackstraw_pca

Non-Parametric Jackstraw for Principal Component Analysis (PCA)
jackstraw_subspace

Jackstraw for the User-Defined Dimension Reduction Methods
jackstraw_rpca

Non-Parametric Jackstraw for Principal Component Analysis (PCA) using Randomized Singular Value Decomposition
jackstraw_pam

Non-Parametric Jackstraw for Partitioning Around Medoids (PAM)
permutationPA

Permutation Parallel Analysis
pip

Compute posterior inclusion probabilities (PIPs)
pseudo_Rsq

Mcfadden's Pseudo R-squared
Jurkat293T

A Jurkat:293T equal mixture dataset from Zheng et al. (2017)