Learn R Programming

⚠️There's a newer version (1.7-4) of this package.Take me there.

protr

Comprehensive toolkit for generating various numerical features of protein sequences described in Xiao et al. (2015) <DOI:10.1093/bioinformatics/btv042> (PDF).

Paper citation

Formatted citation:

Nan Xiao, Dong-Sheng Cao, Min-Feng Zhu, Qing-Song Xu (2015). protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences. Bioinformatics, 31(11), 1857--1859.

BibTeX entry:

@article{Xiao2015,
  author  = {Xiao, Nan and Cao, Dong-Sheng and Zhu, Min-Feng and Xu, Qing-Song.},
  title   = {protr/{ProtrWeb}: {R} package and web server for generating various numerical representation schemes of protein sequences},
  journal = {Bioinformatics},
  year    = {2015},
  volume  = {31},
  number  = {11},
  pages   = {1857--1859},
  doi     = {10.1093/bioinformatics/btv042}
}

Installation

To install protr from CRAN:

install.packages("protr")

Or try the latest version on GitHub:

remotes::install_github("nanxstats/protr")

Browse the package vignette for a quick-start.

Shiny app

ProtrWeb, the Shiny web application built on protr, can be accessed from http://protr.org.

ProtrWeb is a user-friendly web application for computing the protein sequence descriptors (features) presented in the protr package.

List of supported descriptors

Commonly used descriptors

  • Amino acid composition descriptors

    • Amino acid composition
    • Dipeptide composition
    • Tripeptide composition
  • Autocorrelation descriptors

    • Normalized Moreau-Broto autocorrelation
    • Moran autocorrelation
    • Geary autocorrelation
  • CTD descriptors

    • Composition
    • Transition
    • Distribution
  • Conjoint Triad descriptors

  • Quasi-sequence-order descriptors

    • Sequence-order-coupling number
    • Quasi-sequence-order descriptors
  • Pseudo amino acid composition (PseAAC)

    • Pseudo amino acid composition
    • Amphiphilic pseudo amino acid composition
  • Profile-based descriptors

    • Profile-based descriptors derived by PSSM (Position-Specific Scoring Matrix)

Proteochemometric (PCM) modeling descriptors

  • Scales-based descriptors derived by principal components analysis
    • Scales-based descriptors derived by amino acid properties (AAindex)
    • Scales-based descriptors derived by 20+ classes of 2D and 3D molecular descriptors (Topological, WHIM, VHSE, etc.)
    • Scales-based descriptors derived by factor analysis
    • Scales-based descriptors derived by multidimensional scaling
    • BLOSUM and PAM matrix-derived descriptors

Similarity computation

Local and global pairwise sequence alignment for protein sequences:

  • Between two protein sequences
  • Parallelized pairwise similarity calculation with a list of protein sequences
  • Parallelized pairwise similarity calculation between two sets of protein sequences

GO semantic similarity measures:

  • Between two groups of GO terms / two Entrez Gene IDs
  • Parallelized pairwise similarity calculation with a list of GO terms / Entrez Gene IDs

Miscellaneous tools and datasets

  • Retrieve protein sequences from UniProt
  • Read protein sequences in FASTA format
  • Read protein sequences in PDB format
  • Sanity check of the amino acid types appeared in the protein sequences
  • Protein sequence segmentation
  • Auto cross covariance (ACC) for generating scales-based descriptors of the same length
  • 20+ pre-computed 2D and 3D descriptor sets for the 20 amino acids to use with the scales-based descriptors
  • BLOSUM and PAM matrices for the 20 amino acids
  • Meta information of the 20 amino acids

Contribute

To contribute to this project, please take a look at the Contributing Guidelines first. Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Copy Link

Version

Install

install.packages('protr')

Monthly Downloads

715

Version

1.7-0

License

BSD_3_clause + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Nan Xiao

Last Published

October 30th, 2023

Functions in protr (1.7-0)

AAFGC

Functional Group Counts Descriptors for 20 Amino Acids calculated by Dragon
AAPAM120

PAM120 Matrix for 20 Amino Acids
AARDF

RDF Descriptors for 20 Amino Acids calculated by Dragon
AAConn

Connectivity Indices Descriptors for 20 Amino Acids calculated by Dragon
AAMOE3D

3D Descriptors for 20 Amino Acids calculated by MOE 2011.10
AAMetaInfo

Meta Information for the 20 Amino Acids
AAindex

AAindex Data of 544 Physicochemical and Biological Properties for 20 Amino Acids
OptAA3d

OptAA3d.sdf - 20 Amino Acids Optimized with MOE 2011.10 (Semiempirical AM1)
AATopoChg

Topological Charge Indices Descriptors for 20 Amino Acids calculated by Dragon
AATopo

Topological Descriptors for 20 Amino Acids calculated by Dragon
AAEdgeAdj

Edge Adjacency Indices Descriptors for 20 Amino Acids calculated by Dragon
AAPAM30

PAM30 Matrix for 20 Amino Acids
AAPAM250

PAM250 Matrix for 20 Amino Acids
AADescAll

All 2D Descriptors for 20 Amino Acids calculated by Dragon
extractCTDTClass

CTD Descriptors - Transition (with customized amino acid classification support)
extractCTDT

CTD Descriptors - Transition
AARandic

Randic Molecular Profiles Descriptors for 20 Amino Acids calculated by Dragon
extractCTDDClass

CTD Descriptors - Distribution (with customized amino acid classification support)
AAPAM70

PAM70 Matrix for 20 Amino Acids
AAPAM40

PAM40 Matrix for 20 Amino Acids
extractAPAAC

Amphiphilic Pseudo Amino Acid Composition (APseAAC) Descriptor
extractCTriad

Conjoint Triad Descriptor
extractAAC

Amino Acid Composition Descriptor
AAConst

Constitutional Descriptors for 20 Amino Acids calculated by Dragon
extractCTDCClass

CTD Descriptors - Composition (with customized amino acid classification support)
extractCTDD

CTD Descriptors - Distribution
acc

Auto Cross Covariance (ACC) for Generating Scales-Based Descriptors of the Same Length
extractBLOSUM

BLOSUM and PAM Matrix-Derived Descriptors
extractCTDC

CTD Descriptors - Composition
crossSetSim

Parallellized Protein Sequence Similarity Calculation Between Two Sets Based on Sequence Alignment (In-Memory Version)
extractPSSMAcc

Profile-based protein representation derived by PSSM (Position-Specific Scoring Matrix) and auto cross covariance
extractPSSMFeature

Profile-based protein representation derived by PSSM (Position-Specific Scoring Matrix)
parGOSim

Protein Similarity Calculation based on Gene Ontology (GO) Similarity
extractMDSScales

Scales-Based Descriptors derived by Multidimensional Scaling
extractGeary

Geary Autocorrelation Descriptor
extractMoran

Moran Autocorrelation Descriptor
extractMoreauBroto

Normalized Moreau-Broto Autocorrelation Descriptor
getUniProt

Retrieve Protein Sequences from UniProt by Protein ID
extractTC

Tripeptide Composition Descriptor
readFASTA

Read Protein Sequences in FASTA Format
extractScales

Scales-Based Descriptors derived by Principal Components Analysis
readPDB

Read Protein Sequences in PDB Format
extractFAScales

Scales-Based Descriptors derived by Factor Analysis
extractScalesGap

Scales-Based Descriptors derived by Principal Components Analysis (with Gap Support)
extractDescScales

Scales-Based Descriptors with 20+ classes of Molecular Descriptors
protr-package

protr: Generating Various Numerical Representation Schemes for Protein Sequences
protcheck

Protein sequence amino acid type sanity check
parSeqSimDisk

Parallellized Protein Sequence Similarity Calculation Based on Sequence Alignment (Disk-Based Version)
parSeqSim

Parallellized Protein Sequence Similarity Calculation Based on Sequence Alignment (In-Memory Version)
AAWalk

Walk and Path Counts Descriptors for 20 Amino Acids calculated by Dragon
removeGaps

Remove or replace gaps from protein sequences.
extractDC

Dipeptide Composition Descriptor
extractPAAC

Pseudo Amino Acid Composition (PseAAC) Descriptor
extractProtFPGap

Amino Acid Properties Based Scales Descriptors (Protein Fingerprint) with Gap Support
extractPSSM

Compute PSSM (Position-Specific Scoring Matrix) for given protein sequence
twoGOSim

Protein Similarity Calculation based on Gene Ontology (GO) Similarity
AAWHIM

WHIM Descriptors for 20 Amino Acids calculated by Dragon
extractSOCN

Sequence-Order-Coupling Numbers
extractCTriadClass

Conjoint Triad Descriptor (with customized amino acid classification support)
extractQSO

Quasi-Sequence-Order (QSO) Descriptor
twoSeqSim

Protein Sequence Alignment for Two Protein Sequences
protseg

Protein Sequence Segmentation/Partition
extractProtFP

Amino Acid Properties Based Scales Descriptors (Protein Fingerprint)
AABLOSUM62

BLOSUM62 Matrix for 20 Amino Acids
AABLOSUM50

BLOSUM50 Matrix for 20 Amino Acids
AABurden

Burden Eigenvalues Descriptors for 20 Amino Acids calculated by Dragon
AABLOSUM80

BLOSUM80 Matrix for 20 Amino Acids
AACPSA

CPSA Descriptors for 20 Amino Acids calculated by Discovery Studio
AA2DACOR

2D Autocorrelations Descriptors for 20 Amino Acids calculated by Dragon
AAGeom

Geometrical Descriptors for 20 Amino Acids calculated by Dragon
AA3DMoRSE

3D-MoRSE Descriptors for 20 Amino Acids calculated by Dragon
AAGETAWAY

GETAWAY Descriptors for 20 Amino Acids calculated by Dragon
AABLOSUM45

BLOSUM45 Matrix for 20 Amino Acids
AAMolProp

Molecular Properties Descriptors for 20 Amino Acids calculated by Dragon
AAEigIdx

Eigenvalue-Based Indices Descriptors for 20 Amino Acids calculated by Dragon
AAInfo

Information Indices Descriptors for 20 Amino Acids calculated by Dragon
AABLOSUM100

BLOSUM100 Matrix for 20 Amino Acids
AAACF

Atom-Centred Fragments Descriptors for 20 Amino Acids calculated by Dragon
AAMOE2D

2D Descriptors for 20 Amino Acids calculated by MOE 2011.10