extractPSSM: Compute PSSM (Position-Specific Scoring Matrix) for given protein sequence

Description

This function calculates the PSSM (Position-Specific Scoring Matrix) derived by PSI-Blast for given protein sequence or peptides.

Usage

extractPSSM(
  seq,
  start.pos = 1L,
  end.pos = nchar(seq),
  psiblast.path = NULL,
  makeblastdb.path = NULL,
  database.path = NULL,
  iter = 5,
  silent = TRUE,
  evalue = 10L,
  word.size = NULL,
  gapopen = NULL,
  gapextend = NULL,
  matrix = "BLOSUM62",
  threshold = NULL,
  seg = "no",
  soft.masking = FALSE,
  culling.limit = NULL,
  best.hit.overhang = NULL,
  best.hit.score.edge = NULL,
  xdrop.ungap = NULL,
  xdrop.gap = NULL,
  xdrop.gap.final = NULL,
  window.size = NULL,
  gap.trigger = 22L,
  num.threads = 1L,
  pseudocount = 0L,
  inclusion.ethresh = 0.002
)

Value

The original PSSM, a numeric matrix which has end.pos - start.pos + 1 columns and 20 named rows.

Arguments

seq: Character vector, as the input protein sequence.
start.pos: Optional integer denoting the start position of the fragment window. Default is 1, i.e. the first amino acid of the given sequence.
end.pos: Optional integer denoting the end position of the fragment window. Default is nchar(seq), i.e. the last amino acid of the given sequence.
psiblast.path: Character string indicating the path of the psiblast program. If NCBI Blast+ was previously installed in the operation system, the path will be automatically detected.
makeblastdb.path: Character string indicating the path of the makeblastdb program. If NCBI Blast+ was previously installed in the system, the path will be automatically detected.
database.path: Character string indicating the path of a reference database (a FASTA file).
iter: Number of iterations to perform for PSI-Blast.
silent: Logical. Whether the PSI-Blast running output should be shown or not (May not work on some Windows versions and PSI-Blast versions), default is TRUE.
evalue: Expectation value (E) threshold for saving hits. Default is 10.
word.size: Word size for wordfinder algorithm. An integer >= 2.
gapopen: Integer. Cost to open a gap.
gapextend: Integer. Cost to extend a gap.
matrix: Character string. The scoring matrix name (default is "BLOSUM62").
threshold: Minimum word score such that the word is added to the BLAST lookup table. A real value >= 0.
seg: Character string. Filter query sequence with SEG ("yes", "window locut hicut", or "no" to disable). Default is "no".
soft.masking: Logical. Apply filtering locations as soft masks? Default is FALSE.
culling.limit: An integer >= 0. If the query range of a hit is enveloped by that of at least this many higher-scoring hits, delete the hit. Incompatible with best.hit.overhang and best_hit_score_edge.
best.hit.overhang: Best Hit algorithm overhang value (A real value >= 0 and =< 0.5, recommended value: 0.1). Incompatible with culling_limit.
best.hit.score.edge: Best Hit algorithm score edge value (A real value >=0 and =< 0.5, recommended value: 0.1). Incompatible with culling_limit.
xdrop.ungap: X-dropoff value (in bits) for ungapped extensions.
xdrop.gap: X-dropoff value (in bits) for preliminary gapped extensions.
xdrop.gap.final: X-dropoff value (in bits) for final gapped alignment.
window.size: An integer >= 0. Multiple hits window size, To specify 1-hit algorithm, use 0.
gap.trigger: Number of bits to trigger gapping. Default is 22.
num.threads: Integer. Number of threads (CPUs) to use in the BLAST search. Default is 1.
pseudocount: Integer. Pseudo-count value used when constructing PSSM. Default is 0.
inclusion.ethresh: E-value inclusion threshold for pairwise alignments. Default is 0.002.

Author

Nan Xiao <https://nanx.me>

Details

For given protein sequences or peptides, PSSM represents the log-likelihood of the substitution of the 20 types of amino acids at that position in the sequence. Note that the output value is not normalized.

References

Altschul, Stephen F., et al. "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic acids research 25.17 (1997): 3389--3402.

Ye, Xugang, Guoli Wang, and Stephen F. Altschul. "An assessment of substitution scores for protein profile-profile comparison." Bioinformatics 27.24 (2011): 3356--3363.

Rangwala, Huzefa, and George Karypis. "Profile-based direct kernels for remote homology detection and fold recognition." Bioinformatics 21.23 (2005): 4239--4247.

Examples

Run this code

if (Sys.which("makeblastdb") == "" | Sys.which("psiblast") == "") {
  cat("Cannot find makeblastdb or psiblast. Please install NCBI Blast+ first")
} else {
  x <- readFASTA(system.file(
    "protseq/P00750.fasta",
    package = "protr"
  ))[[1]]
  dbpath <- tempfile("tempdb", fileext = ".fasta")
  invisible(file.copy(from = system.file(
    "protseq/Plasminogen.fasta",
    package = "protr"
  ), to = dbpath))

  pssmmat <- extractPSSM(seq = x, database.path = dbpath)

  dim(pssmmat) # 20 x 562 (P00750: length 562, 20 Amino Acids)
}

Run the code above in your browser using DataLab