This function calculates the PSSM (Position-Specific Scoring Matrix) derived by PSI-Blast for given protein sequence or peptides.
extractPSSM(
seq,
start.pos = 1L,
end.pos = nchar(seq),
psiblast.path = NULL,
makeblastdb.path = NULL,
database.path = NULL,
iter = 5,
silent = TRUE,
evalue = 10L,
word.size = NULL,
gapopen = NULL,
gapextend = NULL,
matrix = "BLOSUM62",
threshold = NULL,
seg = "no",
soft.masking = FALSE,
culling.limit = NULL,
best.hit.overhang = NULL,
best.hit.score.edge = NULL,
xdrop.ungap = NULL,
xdrop.gap = NULL,
xdrop.gap.final = NULL,
window.size = NULL,
gap.trigger = 22L,
num.threads = 1L,
pseudocount = 0L,
inclusion.ethresh = 0.002
)
The original PSSM, a numeric matrix which has
end.pos - start.pos + 1
columns and 20
named rows.
Character vector, as the input protein sequence.
Optional integer denoting the start position of the
fragment window. Default is 1
,
i.e. the first amino acid of the given sequence.
Optional integer denoting the end position of the
fragment window. Default is nchar(seq)
,
i.e. the last amino acid of the given sequence.
Character string indicating the path of the
psiblast
program.
If NCBI Blast+ was previously installed in the operation system,
the path will be automatically detected.
Character string indicating the path of the
makeblastdb
program.
If NCBI Blast+ was previously installed in the system,
the path will be automatically detected.
Character string indicating the path of a reference database (a FASTA file).
Number of iterations to perform for PSI-Blast.
Logical. Whether the PSI-Blast running output
should be shown or not (May not work on some Windows versions and
PSI-Blast versions), default is TRUE
.
Expectation value (E) threshold for saving hits.
Default is 10
.
Word size for wordfinder algorithm. An integer >= 2.
Integer. Cost to open a gap.
Integer. Cost to extend a gap.
Character string. The scoring matrix name
(default is "BLOSUM62"
).
Minimum word score such that the word is added to the BLAST lookup table. A real value >= 0.
Character string. Filter query sequence with SEG ("yes"
,
"window locut hicut"
, or "no"
to disable).
Default is "no"
.
Logical. Apply filtering locations as soft masks?
Default is FALSE
.
An integer >= 0. If the query range of a hit is
enveloped by that of at least this many higher-scoring hits,
delete the hit. Incompatible with best.hit.overhang
and
best_hit_score_edge
.
Best Hit algorithm overhang value
(A real value >= 0 and =< 0.5, recommended value: 0.1).
Incompatible with culling_limit
.
Best Hit algorithm score edge value
(A real value >=0 and =< 0.5, recommended value: 0.1).
Incompatible with culling_limit
.
X-dropoff value (in bits) for ungapped extensions.
X-dropoff value (in bits) for preliminary gapped extensions.
X-dropoff value (in bits) for final gapped alignment.
An integer >= 0. Multiple hits window size,
To specify 1-hit algorithm, use 0
.
Number of bits to trigger gapping. Default is 22
.
Integer. Number of threads (CPUs) to use in the
BLAST search. Default is 1
.
Integer. Pseudo-count value used when constructing PSSM.
Default is 0
.
E-value inclusion threshold for pairwise alignments.
Default is 0.002
.
Nan Xiao <https://nanx.me>
For given protein sequences or peptides, PSSM represents the log-likelihood of the substitution of the 20 types of amino acids at that position in the sequence. Note that the output value is not normalized.
Altschul, Stephen F., et al. "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic acids research 25.17 (1997): 3389--3402.
Ye, Xugang, Guoli Wang, and Stephen F. Altschul. "An assessment of substitution scores for protein profile-profile comparison." Bioinformatics 27.24 (2011): 3356--3363.
Rangwala, Huzefa, and George Karypis. "Profile-based direct kernels for remote homology detection and fold recognition." Bioinformatics 21.23 (2005): 4239--4247.
extractPSSMFeature extractPSSMAcc
if (Sys.which("makeblastdb") == "" | Sys.which("psiblast") == "") {
cat("Cannot find makeblastdb or psiblast. Please install NCBI Blast+ first")
} else {
x <- readFASTA(system.file(
"protseq/P00750.fasta",
package = "protr"
))[[1]]
dbpath <- tempfile("tempdb", fileext = ".fasta")
invisible(file.copy(from = system.file(
"protseq/Plasminogen.fasta",
package = "protr"
), to = dbpath))
pssmmat <- extractPSSM(seq = x, database.path = dbpath)
dim(pssmmat) # 20 x 562 (P00750: length 562, 20 Amino Acids)
}
Run the code above in your browser using DataLab