Learn R Programming

⚠️There's a newer version (1.12.15) of this package.Take me there.

bigsnpr

{bigsnpr} is an R package for the analysis of massive SNP arrays, primarily designed for human genetics. It enhances the features of package {bigstatsr} for the purpose of analyzing genotype data.

Quick demo

LIST OF FEATURES

Installation

In R, run

# install.packages("remotes")
remotes::install_github("privefl/bigsnpr")

or for the CRAN version

install.packages("bigsnpr")

Input formats

This package reads bed/bim/fam files (PLINK preferred format) using functions snp_readBed() and snp_readBed2(). Before reading into this package's special format, quality control and conversion can be done using PLINK, which can be called directly from R using snp_plinkQC() and snp_plinkKINGQC().

This package can also read UK Biobank BGEN files using function snp_readBGEN(). This function takes around 40 minutes to read 1M variants for 400K individuals using 15 cores.

This package uses a class called bigSNP for representing SNP data. A bigSNP object is a list with some elements:

  • $genotypes: A FBM.code256. Rows are samples and columns are variants. This stores genotype calls or dosages (rounded to 2 decimal places).
  • $fam: A data.frame with some information on the individuals.
  • $map: A data.frame with some information on the variants.

Note that most of the algorithms of this package don't handle missing values. You can use snp_fastImpute() (taking a few hours for a chip of 15K x 300K) and snp_fastImputeSimple() (taking a few minutes only) to impute missing values of genotyped variants.

Package {bigsnpr} also provides functions that directly work on bed files with a few missing values (the bed_*() functions). See paper "Efficient toolkit implementing..".

Polygenic scores

Polygenic scores are one of the main focus of this package. There are 3 main methods currently available:

  • Penalized regressions with individual-level data (see paper and tutorial)

  • Clumping and Thresholding (C+T) and Stacked C+T (SCT) with summary statistics and individual level data (see paper and tutorial).

  • LDpred2 with summary statistics (see paper and tutorial)

Possible upcoming features

You can request some feature by opening an issue.

Bug report / Support

How to make a great R reproducible example?

Please open an issue if you find a bug.

If you want help using {bigstatsr} (the big_*() functions), please open an issue on {bigstatsr}'s repo, or post on Stack Overflow with the tag bigstatsr.

I will always redirect you to GitHub issues if you email me, so that others can benefit from our discussion.

References

Copy Link

Version

Install

install.packages('bigsnpr')

Monthly Downloads

1,668

Version

1.9.11

License

GPL-3

Maintainer

Last Published

February 16th, 2022

Functions in bigsnpr (1.9.11)

bed_counts

Counts
bed_cprodVec

Cross-product with a vector
download_beagle

Download Beagle 4.1
snp_ldpred2_inf

LDpred2
LD.wiki34

Long-range LD regions
bed-methods

Methods for the bed class
download_plink

Download PLINK
bed_MAF

Allele frequencies
snp_attach

Attach a "bigSNP" from backing files
download_1000G

Download 1000G
snp_MAX3

MAX3 statistic
coef_to_liab

Liability scale
SCT

Stacked C+T (SCT)
snp_attachExtdata

Attach a "bigSNP" for examples and tests
bed-class

Class bed
bed_randomSVD

Randomized partial SVD
bed_projectSelfPCA

Projecting PCA
bigSNP-class

Class bigSNP
bigsnpr-package

bigsnpr: Analysis of Massive SNP Arrays
same_ref

Determine reference divergence
seq_log

Sequence, evenly spaced on a logarithmic scale
snp_MAF

MAF
reexports

Objects exported from other packages
snp_autoSVD

Truncated SVD while limiting LD
snp_beagleImpute

Imputation
snp_modifyBuild

Modify genome build
snp_match

Match alleles
bed_projectPCA

Projecting PCA
bed_prodVec

Product with a vector
bed_scaleBinom

Binomial(2, p) scaling
snp_pcadapt

Outlier detection
bed_tcrossprodSelf

tcrossprod / GRM
snp_save

Save modifications
snp_scaleAlpha

Binomial(n, p) scaling
snp_plinkRmSamples

Remove samples
snp_fst

Fixation index (Fst)
snp_gc

Genomic Control
snp_prodBGEN

BGEN matrix product
snp_ancestry_summary

Estimation of ancestry proportions
snp_asGeneticPos

Interpolate to genetic positions
snp_PRS

PRS
snp_fake

Fake a "bigSNP"
snp_manhattan

Manhattan plot
snp_plinkQC

Quality Control
snp_cor

Correlation matrix
snp_ldsplit

Independent LD blocks
snp_plinkKINGQC

Relationship-based pruning
snp_simuPheno

Simulate phenotypes
snp_subset

Subset a bigSNP
snp_readBGI

Read variant info from one BGI file
snp_thr_correct

Thresholding and correction
snp_readBed

Read PLINK files into a "bigSNP"
snp_plinkIBDQC

Identity-by-descent
snp_fastImpute

Fast imputation
snp_fastImputeSimple

Fast imputation
snp_ld_scores

LD scores
snp_ldsc

LD score regression
CODE_012

CODE_012: code genotype calls (3) and missing values.
snp_lassosum2

lassosum2
bed_clumping

LD clumping
snp_getSampleInfos

Get sample information
snp_split

Split-parApply-Combine
snp_readBGEN

Read BGEN files into a "bigSNP"
sub_bed

Replace extension '.bed'
snp_qq

Q-Q plot
snp_writeBed

Write PLINK files from a "bigSNP"