fpSim: Fingerprint Search

Description

Search function for fingerprints, such as PubChem or atom pair fingerprints. Enables structure similarity comparisons, searching and clustering.

Usage

fpSim(x, y, sorted=TRUE, method="Tanimoto", 
		addone=1, cutoff=0, top="all", alpha=1, beta=1,
		parameters=NULL,scoreType="similarity")

Arguments

Query molecule of class numeric, FP or FPset (of length one) containing binary fingerprint data. Both x and y need to have the same number of bits and should contain the same type of fingerprints.

Subject molecule(s) of class numeric, matrix, FP or FPset containing binary fingerprint data.

sorted

return results sorted or unsorted

method

Similarity coefficient to return. One can choose here from several predefined similarity measures: "Tanimoto" (default), "Euclidean", "Tversky" or "Dice". Alternatively, one can pass on any custom similarity function containing the arguments a, b, c and d. For instance, one can define "myfct <- function(a, b, c, d) c/(alpha*a + beta*b + c)" and then pass on method=myfct. The variable 'c' is the number of "on-bits" common in both compounds, 'd' is the number of "off-bits" common in both compounds, and 'a' and 'b' are the number of "on-bits" that are unique in one or the other compound, respectively.

The predefined methods will run a C++ version of this function which is about twice as fast as the R version. When a custom similarity function is given however, it will fall back to using the R version.

addone

Value to add to numerator and denominator of similarity coefficient to avoid devision by zero when fingerprint(s) contain only "off-bits" (zeros). Note: if addone > 0 then fingerprints with no "on-bits" will receive the highest similarity score. Typically, this occurs only with extremely small molecules.

cutoff

allows to restrict results to hits above a similarity cutoff value; default cutoff=0 returns results for all compounds in y.

top

allows to restrict number of subject molecules to return; default top="all" returns results for all compounds in y above cutoff value.

alpha

Only used when method="Tversky". Allows to specify the weighting variable 'alpha' of the Tversky index: c/(alpha*a + beta*b + c)

beta

Only used when method="Tversky". Allows to specify the weighting variable 'beta' of the Tversky index.

parameters

Parameters for computing Z-scores, E-values, and p-values. Pass this data if you want these scores returned. This data can be generated with the genParameters function.

scoreType

If using the parameters argument, this argument specified which type of score the cutoff and sorted arguments should be applied to. It should be one of "similarity" (default), "zscore", "evalue", or "pvalue".

Value

Returns numeric vector with similarity coefficients as values and compound identifiers as names.

References

Tanimoto similarity coefficient: Tanimoto TT (1957) IBM Internal Report 17th Nov see also Jaccard P (1901) Bulletin del la Societe Vaudoisedes Sciences Naturelles 37, 241-272.

PubChem fingerprint specification: ftp://ftp.ncbi.nih.gov/pubchem/specifications/pubchem_fingerprints.txt

Examples

Run this code

## Load PubChem SDFset sample
data(sdfsample); sdfset <- sdfsample
cid(sdfset) <- sdfid(sdfset)

## Convert base 64 encoded fingerprints to character vector or binary matrix
fpset <- fp2bit(sdfset)

## Alternatively, one can use atom pair fingerprints 
## Not run: 
# fpset <- desc2fp(sdf2ap(sdfset))
# ## End(Not run)

## Pairwise compound structure comparisons
fpSim(x=fpset[1], y=fpset[2], method="Tanimoto")

## Structure similarity searching: x is query and y is fingerprint database  
fpSim(x=fpset[1], y=fpset) 

## Controlling the output
fpSim(x=fpset[1], y=fpset, method="Tversky", cutoff=0.4, top=4, alpha=0.5, beta=1) 

## Use custom distance function
myfct <- function(a, b, c, d) c/(a+b+c+d)
fpSim(x=fpset[1], y=fpset, method=myfct) 

## Compute fingerprint-based Tanimoto similarity matrix 
simMA <- sapply(cid(fpset), function(x) fpSim(x=fpset[x], fpset, sorted=FALSE)) 

## Hierarchical clustering with simMA as input
hc <- hclust(as.dist(1-simMA), method="single")

## Plot hierarchical clustering tree
plot(as.dendrogram(hc), edgePar=list(col=4, lwd=2), horiz=TRUE)