Learn R Programming

protr (version 1.7-4)

parSeqSim: Parallel Protein Sequence Similarity Calculation Based on Sequence Alignment (In-Memory Version)

Description

Parallel calculation of protein sequence similarity based on sequence alignment.

Usage

parSeqSim(
  protlist,
  cores = 2,
  batches = 1,
  verbose = FALSE,
  type = "local",
  submat = "BLOSUM62",
  gap.opening = 10,
  gap.extension = 4
)

Value

A n x n similarity matrix.

Arguments

protlist

A length n list containing n protein sequences, each component of the list is a character string, storing one protein sequence. Unknown sequences should be represented as "".

cores

Integer. The number of CPU cores to use for parallel execution, default is 2. Users can use the availableCores() function in the parallelly package to see how many cores they could use.

batches

Integer. How many batches should we split the pairwise similarity computations into. This is useful when you have a large number of protein sequences, enough number of CPU cores, but not enough RAM to compute and fit all the pairwise similarities into a single batch. Defaults to 1.

verbose

Print the computation progress? Useful when batches > 1.

type

Type of alignment, default is "local", can be "global" or "local", where "global" represents Needleman-Wunsch global alignment; "local" represents Smith-Waterman local alignment.

submat

Substitution matrix, default is "BLOSUM62", can be one of "BLOSUM45", "BLOSUM50", "BLOSUM62", "BLOSUM80", "BLOSUM100", "PAM30", "PAM40", "PAM70", "PAM120", or "PAM250".

gap.opening

The cost required to open a gap of any length in the alignment. Defaults to 10.

gap.extension

The cost to extend the length of an existing gap by 1. Defaults to 4.

Author

Nan Xiao <https://nanx.me>

See Also

See parSeqSimDisk for the disk-based version.

Examples

Run this code
if (FALSE) {

# Be careful when testing this since it involves parallelization
# and might produce unpredictable results in some environments

library("Biostrings")
library("foreach")
library("doParallel")

s1 <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]]
s2 <- readFASTA(system.file("protseq/P08218.fasta", package = "protr"))[[1]]
s3 <- readFASTA(system.file("protseq/P10323.fasta", package = "protr"))[[1]]
s4 <- readFASTA(system.file("protseq/P20160.fasta", package = "protr"))[[1]]
s5 <- readFASTA(system.file("protseq/Q9NZP8.fasta", package = "protr"))[[1]]
plist <- list(s1, s2, s3, s4, s5)
(psimmat <- parSeqSim(plist, cores = 2, type = "local", submat = "BLOSUM62"))
}

Run the code above in your browser using DataLab