blast.pdb: NCBI BLAST Sequence Search and Summary Plot of Hit Statistics

Description

Run NCBI blastp, on a given sequence, against the PDB, NR and swissprot sequence databases. Produce plots that facilitate hit selection from the match statistics of a BLAST result.

Usage

blast.pdb(seq, database = "pdb", time.out = NULL, chain.single=TRUE)
get.blast(urlget, time.out = NULL, chain.single=TRUE)
"plot"(x, cutoff = NULL, cut.seed=NULL, cluster=TRUE, mar=c(2, 5, 1, 1), cex=1.5, ...)

Arguments

seq

a single element or multi-element character vector containing the query sequence. Alternatively a ‘fasta’ object from function get.seq can be provided.

database

a single element character vector specifying the database against which to search. Current options are ‘pdb’, ‘nr’ and ‘swissprot’.

time.out

integer specifying the number of seconds to wait for the blast reply before a time out occurs.

urlget

the URL to retrieve BLAST results; Usually it is returned by blast.pdb if time.out is set and met.

chain.single

logical, if TRUE double NCBI character PDB database chain identifiers are simplified to lowercase '1WF4_GG' > '1WF4_g'. If FALSE no conversion to match RCSB PDB files is performed.

BLAST results as obtained from the function blast.pdb.

cutoff

A numeric cutoff value, in terms of minus the log of the evalue, for returned hits. If null then the function will try to find a suitable cutoff near ‘cut.seed’ which can be used as an initial guide (see below).

cut.seed

A numeric seed cutoff value, used for initial cutoff estimation. If null then a seed position is set to the point of largest drop-off in normalized scores (i.e. the biggest jump in E-values).

cluster

Logical, if TRUE (and ‘cutoff’ is null) a clustering of normalized scores is performed to partition hits in groups by similarity to query. If FALSE the partition point is set to the point of largest drop-off in normalized scores.

mar

A numerical vector of the form c(bottom, left, top, right) which gives the number of lines of margin to be specified on the four sides of the plot.

cex

a numerical single element vector giving the amount by which plot labels should be magnified relative to the default.

...

extra plotting arguments.

Value

bitscore: a numeric vector containing the raw score for each alignment.
evalue: a numeric vector containing the E-value of the raw score for each alignment.
mlog.evalue: a numeric vector containing minus the natural log of the E-value.
gi.id: a character vector containing the gi database identifier of each hit.
pdb.id: a character vector containing the PDB database identifier of each hit.
hit.tbl: a character matrix summarizing BLAST results for each reported hit, see below.
raw: a data frame summarizing BLAST results, note multiple hits may appear in the same row.
url: a single element character vector with the NCBI result URL and RID code. This can be passed to the get.blast function.
hits: an ordered matrix detailing the subset of hits with a normalized score above the chosen cutoff. Database identifiers are listed along with their cluster group number.
pdb.id: a character vector containing the PDB database identifier of each hit above the chosen threshold.
gi.id: a character vector containing the gi database identifier of each hit above the chosen threshold.

Details

The blast.pdb function employs direct HTTP-encoded requests to the NCBI web server to run BLASTP, the protein search algorithm of the BLAST software package.

BLAST, currently the most popular pairwise sequence comparison algorithm for database searching, performs gapped local alignments via a heuristic strategy: it identifies short nearly exact matches or hits, bidirectionally extends non-overlapping hits resulting in ungapped extended hits or high-scoring segment pairs(HSPs), and finally extends the highest scoring HSP in both directions via a gapped alignment (Altschul et al., 1997)

For each pairwise alignment BLAST reports the raw score, bitscore and an E-value that assess the statistical significance of the raw score. Note that unlike the raw score E-values are normalized with respect to both the substitution matrix and the query and database lengths.

Here we also return a corrected normalized score (mlog.evalue) that in our experience is easier to handle and store than conventional E-values. In practice, this score is equivalent to minus the natural log of the E-value. Note that, unlike the raw score, this score is independent of the substitution matrix and and the query and database lengths, and thus is comparable between BLASTP searches.

Examining plots of BLAST alignment lengths, scores, E-values and normalized scores (-log(E-Value) from the blast.pdb function can aid in the identification sensible hit similarity thresholds. This is facilitated by the plot.blast function.

If a ‘cutoff’ value is not supplied then a basic hierarchical clustering of normalized scores is performed with initial group partitioning implemented at a hopefully sensible point in the vicinity of ‘h=cut.seed’. Inspection of the resultant plot can then be use to refine the value of ‘cut.seed’ or indeed ‘cutoff’. As the ‘cutoff’ value can vary depending on the desired application and indeed the properties of the system under study it is envisaged that ‘plot.blast’ will be called multiple times to aid selection of a suitable ‘cutoff’ value. See the examples below for further details.

References

Grant, B.J. et al. (2006) Bioinformatics 22, 2695--2696.

‘BLAST’ is the work of Altschul et al.: Altschul, S.F. et al. (1990) J. Mol. Biol. 215, 403--410. Full details of the ‘BLAST’ algorithm, along with download and installation instructions can be obtained from: http://www.ncbi.nlm.nih.gov/BLAST/.

Examples

Run this code

## Not run: 
# pdb <- read.pdb("4q21")
# blast <- blast.pdb( pdbseq(pdb) )
# 
# head(blast$hit.tbl)
# top.hits <- plot(blast)
# head(top.hits$hits)
# 
# ## Use 'get.blast()' to retrieve results at a later time.
# #x <- get.blast(blast$url)
# #head(x$hit.tbl)
# 
# # Examine and download 'best' hits
# top.hits <- plot.blast(blast, cutoff=188)
# head(top.hits$hits)
# #get.pdb(top.hits)
# ## End(Not run)

Run the code above in your browser using DataLab