Functions for computing similarity between two vectors or sets. See "Details" for exact formulas.
- Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them.
- Tversky index is an asymmetric similarity measure on sets that compares a variant to a prototype.
- Overlap cofficient is a similarity measure related to the Jaccard index that measures the overlap between two sets, and is defined as the size of the intersection divided by the smaller of the size of the two sets.
- Jaccard index is a statistic used for comparing the similarity and diversity of sample sets.
- Morisita's overlap index is a statistical measure of dispersion of individuals in a population. It is used to compare overlap among samples (Morisita 1959). This formula is based on the assumption that increasing the size of the samples will increase the diversity because it will include different habitats (i.e. different faunas).
- Horn's overlap index based on Shannon's entropy.
Use the repOverlap function for computing similarities of clonesets.
cosine.similarity(.alpha, .beta, .do.norm = NA, .laplace = 0)tversky.index(x, y, .a = 0.5, .b = 0.5)
overlap.coef(.alpha, .beta)
jaccard.index(.alpha, .beta, .intersection.number = NA)
morisitas.index(.alpha, .beta, .do.unique = T)
horn.index(.alpha, .beta, .do.unique = T)
Vector of numeric values for cosine similarity, vector of any values
(like characters) for tversky.index
and overlap.coef
, matrix or data.frame with 2 columns for morisitas.index
and horn.index
,
either two sets or two numbers of elements in sets for jaccard.index
.
One of the three values - NA, T or F. If NA than check for distrubution (sum(.data) == 1) and normalise if needed with the given laplace correction value. if T then do normalisation and laplace correction. If F than don't do normalisaton and laplace correction.
Value for Laplace correction.
Alpha and beta parameters for Tversky Index. Default values gives the Jaccard index measure.
if T then call unique on the first columns of the given data.frame or matrix.
Number of intersected elements between two sets. See "Details" for more information.
Value of similarity between the given sets or vectors.
For morisitas.index
input data are matrices or data.frames with two columns: first column is
elements (species or individuals), second is a number of elements (species or individuals) in a population.
Formulas:
Cosine similarity: cos(a, b) = a * b / (||a|| * ||b||)
Tversky index: S(X, Y) = |X and Y| / (|X and Y| + a*|X - Y| + b*|Y - X|)
Overlap coefficient: overlap(X, Y) = |X and Y| / min(|X|, |Y|)
Jaccard index: J(A, B) = |A and B| / |A U B|
For Jaccard index user can provide |A and B| in .intersection.number
otherwise it will be computed
using base::intersect
function. In this case .alpha
and .beta
expected to be vectors of elements.
If .intersection.number
is provided than .alpha
and .beta
are exptected to be numbers of elements.
Formula for Morisita's overlap index is quite complicated and can't be easily shown here, so just look at this webpage: http://en.wikipedia.org/wiki/Morisita
# NOT RUN {
jaccard.index(1:10, 2:20)
a <- length(unique(immdata[[1]][, c('CDR3.amino.acid.sequence', 'V.gene')]))
b <- length(unique(immdata[[2]][, c('CDR3.amino.acid.sequence', 'V.gene')]))
# Next
jaccard.index(a, b, repOverlap(immdata[1:2], .seq = 'aa', .vgene = T))
# is equal to
repOverlap(immdata[1:2], 'jaccard', seq = 'aa', .vgene = T)
# }
Run the code above in your browser using DataLab