The input clustering
is typically produced by bClust
. The input
dist.tbl
is typically produced by bDist
.
The concept of orthologs is difficult for prokaryotes, and this function finds orthologs in a
simplistic way. For a given cluster, with members from many genomes, there is one ortholog from every
genome. In cases where a genome has two or more members in the same cluster, only one of these is an
ortholog, the rest are paralogs.
Consider all sequences from the same genome belonging to the same cluster. The ortholog is defined as
the one having the smallest sum of distances to all other members of the same cluster, i.e. the one
closest to the ‘center’ of the cluster.
Note that the status as ortholog or paralog depends greatly on how clusters are defined in the first
place. If you allow large and diverse (and few) clusters, many sequences will be paralogs. If you define
tight and homogenous (and many) clusters, almost all sequences will be orthologs.