ide.filter: Percent Identity Filter

Description

Identify and filter subsets of sequences at a given sequence identity cutoff.

Usage

ide.filter(aln = NULL, ide = NULL, cutoff = 0.6, verbose = TRUE, ncore=1, nseg.scale=1)

Arguments

aln

sequence alignment list, obtained from seqaln or read.fasta, or an alignment character matrix. Not used if ide is given.

ide

an optional identity matrix obtained from seqidentity.

cutoff

a numeric identity cutoff value ranging between 0 and 1.

verbose

logical, if TRUE print details of the clustering process.

ncore

number of CPU cores used to do the calculation. ncore>1 requires package parallel installed.

nseg.scale

split input data into specified number of segments prior to running multiple core calculation. See fit.xyz.

Value

Returns a list object with components:
indindices of the sequences below the cutoff value.
treean object of class "hclust", which describes the tree produced by the clustering process.
idea numeric matrix with all pairwise identity values.

Details

This function performs hierarchical cluster analysis of a given sequence identity matrix ide, or the identity matrix calculated from a given alignment aln, to identify sequences that fall below a given identity cutoff value cutoff.

References

Grant, B.J. et al. (2006) Bioinformatics 22, 2695--2696.

Examples

Run this code

data(kinesin)
attach(kinesin, warn.conflicts=FALSE)

ide.mat <- seqidentity(pdbs)

# Histogram of pairwise identity values
op <- par(no.readonly=TRUE)
par(mfrow=c(2,1))
hist(ide.mat[upper.tri(ide.mat)], breaks=30,xlim=c(0,1),
     main="Sequence Identity", xlab="Identity")

k <- ide.filter(ide=ide.mat, cutoff=0.6)
ide.cut <- seqidentity(pdbs$ali[k$ind,])
hist(ide.cut[upper.tri(ide.cut)], breaks=10, xlim=c(0,1),
     main="Sequence Identity", xlab="Identity")

#plot(k$tree, axes = FALSE, ylab="Sequence Identity")
#print(k$ind) # selected
par(op)
detach(kinesin)

Run the code above in your browser using DataLab