Learn R Programming

bigsnpr

{bigsnpr} is an R package for the analysis of massive SNP arrays, primarily designed for human genetics. It enhances the features of package {bigstatsr} for the purpose of analyzing genotype data.

To get you started:

Installation

In R, run

# install.packages("remotes")
remotes::install_github("privefl/bigsnpr")

or for the CRAN version

install.packages("bigsnpr")

Input formats

This package reads bed/bim/fam files (PLINK preferred format) using functions snp_readBed() and snp_readBed2(). Before reading into this package's special format, quality control and conversion can be done using PLINK, which can be called directly from R using snp_plinkQC() and snp_plinkKINGQC().

This package can also read UK Biobank BGEN files using function snp_readBGEN(). This function takes around 40 minutes to read 1M variants for 400K individuals using 15 cores.

This package uses a class called bigSNP for representing SNP data. A bigSNP object is a list with some elements:

  • $genotypes: A FBM.code256. Rows are samples and columns are variants. This stores genotype calls or dosages (rounded to 2 decimal places).
  • $fam: A data.frame with some information on the individuals.
  • $map: A data.frame with some information on the variants.

Note that most of the algorithms of this package don't handle missing values. You can use snp_fastImpute() (taking a few hours for a chip of 15K x 300K) and snp_fastImputeSimple() (taking a few minutes only) to impute missing values of genotyped variants.

Package {bigsnpr} also provides functions that directly work on bed files with a few missing values (the bed_*() functions). See paper "Efficient toolkit implementing..".

Polygenic scores

Polygenic scores are one of the main focus of this package. There are 3 main methods currently available:

  • Penalized regressions with individual-level data (see paper and tutorial)

  • Clumping and Thresholding (C+T) and Stacked C+T (SCT) with summary statistics and individual level data (see paper and tutorial).

  • LDpred2 with summary statistics (see paper and tutorial)

Possible upcoming features

You can request some feature by opening an issue.

Bug report / Support

How to make a great R reproducible example?

Please open an issue if you find a bug.

If you want help using {bigstatsr} (the big_*() functions), please open an issue on {bigstatsr}'s repo, or post on Stack Overflow with the tag bigstatsr.

I will always redirect you to GitHub issues if you email me, so that others can benefit from our discussion.

References

Copy Link

Version

Install

install.packages('bigsnpr')

Monthly Downloads

1,668

Version

1.12.15

License

GPL-3

Maintainer

Last Published

September 20th, 2024

Functions in bigsnpr (1.12.15)

snp_asGeneticPos

Interpolate to genetic positions
bigSNP-class

Class bigSNP
bed_tcrossprodSelf

tcrossprod / GRM
bed_scaleBinom

Binomial(2, p) scaling
bed_randomSVD

Randomized partial SVD
reexports

Objects exported from other packages
download_plink

Download PLINK
download_genetic_map

Download a genetic map
snp_autoSVD

Truncated SVD while limiting LD
snp_beagleImpute

Imputation
snp_fake

Fake a "bigSNP"
snp_cor

Correlation matrix
snp_PRS

PRS
snp_ancestry_summary

Estimation of ancestry proportions
snp_fastImputeSimple

Fast imputation
snp_fastImpute

Fast imputation
CODE_012

CODE_012: code genotype calls (3) and missing values.
snp_MAX3

MAX3 statistic
bed_clumping

LD clumping
snp_MAF

MAF
snp_ldsc

LD score regression
snp_manhattan

Manhattan plot
snp_ldsplit

Independent LD blocks
snp_ld_scores

LD scores
snp_prodBGEN

BGEN matrix product
snp_lassosum2

lassosum2
snp_plinkRmSamples

Remove samples
snp_getSampleInfos

Get sample information
snp_match

Match alleles
snp_readBGEN

Read BGEN files into a "bigSNP"
snp_modifyBuild

Modify genome build
snp_qq

Q-Q plot
seq_log

Sequence, evenly spaced on a logarithmic scale
same_ref

Determine reference divergence
snp_readBed

Read PLINK files into a "bigSNP"
snp_readBGI

Read variant info from one BGI file
snp_fst

Fixation index (Fst)
snp_plinkKINGQC

Relationship-based pruning
snp_plinkQC

Quality Control
snp_gc

Genomic Control
snp_simuPheno

Simulate phenotypes
snp_split

Split-parApply-Combine
snp_attachExtdata

Attach a "bigSNP" for examples and tests
snp_pcadapt

Outlier detection
snp_plinkIBDQC

Identity-by-descent
snp_subset

Subset a bigSNP
snp_attach

Attach a "bigSNP" from backing files
snp_writeBed

Write PLINK files from a "bigSNP"
sub_bed

Replace extension '.bed'
snp_save

Save modifications
snp_thr_correct

Thresholding and correction
snp_scaleAlpha

Binomial(n, p) scaling
bed_cprodVec

Cross-product with a vector
bed_prodVec

Product with a vector
download_1000G

Download 1000G
download_beagle

Download Beagle 4.1
bed-class

Class bed
SCT

Stacked C+T (SCT)
bed_MAF

Allele frequencies
bed_projectPCA

Projecting PCA
bed_counts

Counts
bed_projectSelfPCA

Projecting PCA
coef_to_liab

Liability scale
LD.wiki34

Long-range LD regions
[,bed,ANY,ANY,ANY-method

Accessor methods for class bed.
snp_ldpred2_inf

LDpred2
bed-methods

Methods for the bed class
bigsnpr-package

bigsnpr: Analysis of Massive SNP Arrays