Learn R Programming

FindMyFriends

Fast alignment-free pangenome creation and exploration

Release: Devel: @master:

FindMyFriends is an R package for doing pangenomic analyses on microbial genomes. It is released as part of the Bioconductor project and can be installed with the biocLite() function:

source("https://bioconductor.org/biocLite.R")
biocLite("FindMyFriends")

For the absolute latest version, install directly from GitHub:

if(!require(devtools)) {
  install.packages('devtools')
  library(devtools)
}
install_github('thomasp85/FindMyFriends')

But what is this really about?

In comparative microbial genomics a pangenome is defined as a grouping of genes across genomes based on some sort of similarity. This similarity meassure is not set in stone, but often it is derived from BLASTing each pair of genes against each other. This is a bad idea for several reasons: comparing all against all leads to a horrible scaling of computational time as the number of genes increase, BLAST is in general really slow, and sequence similarity alone cannot distinguish orthologue genes from paralogues. The last point has been adressed by recent tools such as PanOCT and Roary, but the first two still stands (though Roary do something clever to make it less of an issue).

Enter FindMyFriends...

So this is just another algorithm?

It is also another algorithm. But more importantly it is a framework for conducting pangenome analysis that is completely agnostic to how you've derived your pangenome in the first place. FindMyFriends defines an extendible list of classes for handling pangenome data in a transparent way, and plugs directly into the vast array of genomic tools offered by Bioconductor.

Okay, back to the algorithms. FindMyFriends works by using CD-Hit to create a very coarse grouping of the genes in your dataset, and then refine this grouping in a second pass using additional similarity meassures. This is in contrast to Roary that uses CD-Hit, but only to group the most similar genes together prior to running BLAST. The second pass in FindMyFriends is where all the magic happens. The genes in each large group is compared by sequence similarity (using kmer cosine similarity), sequence length, genome membership and neighborhood similarity. Based on these comparisons a graph is created for each group, with edges defining similarity above a certain threshold between genes. From this graph cliques are gradually extracted in a way that ensures the highest quality cliques are extracted first. These cliques defines the final grouping of genes. Because they are cliques the user can be sure that all members of the resulting gene groups share a defined similarity with each other and that no gene can be grouped with others solemnly based on a high similarity to one member.

That sound like a lot of work

Well, high quality results are more important than speed! But this is one of the rare cases where you can have your cake and eat it too. FindMyFriends is, by a large margin, the fastest algorithm out there:

FindMyFriends scales to thousands of genomes, and can handle large diversity (i.e. not restricted to species level). As an example a pangenome based on ~1200 strains from the order Lactobacillales (Lactic Acid Bacteria) was created in around 8 hours on a c3x8.large AWS instance using a single core.

How do I use it then?

Being a framework there is a lot of things you can do and many different ways to do it. Following is the recommended approach to calculating a pangenome:

library(FindMyFriends)

# We expect here that your genomes are stored in amino acid fasta files in the
# working directory.
genomes <- list.files(pattern = '.fasta')

# First we create our pangenome object
pg <- pangenome(genomes, translated = TRUE, geneLocation = 'prodigal')

# Then we make the initial grouping
pg <- cdhitGrouping(pg)

# And lastly we refine the groups
pg <- neighborhoodSplit(pg)

please see the vignette for more information on the different steps as well as examples on what you can do with your data once you're done grouping your genes.

What will happen with this in the future?

Following are some of the features that are being worked on/considered:

  • GFF3 and GBK file support
  • Improved plotStat and plotEvolution
  • Even more panchromosomal analysis tools
    1. plotPC
    2. Automatic frameshift detection
    3. Support for storing pc-derived grouping in object
  • Better parallelization
  • Switch to using data.table internally for better performance
  • Exporting functions
  • sqlite based class for very low memory interface

Copy Link

Version

Version

1.2.2

License

GPL (>=2)

Issues

Pull Requests

Stars

Forks

Maintainer

Thomas Pedersen

Last Published

February 15th, 2017

Functions in FindMyFriends (1.2.2)

orgNames

Get and set the names of organisms in the pangenome
defaults

Access default values for a pgVirtual subclass object
cdhitGrouping

Gene grouping by preclustering with CD-HIT
pgSlimLoc-class

Class for pangenome data with no reference to genes
getNeighborhood

Extract a graph representation of a gene group neighborhood
addGenomes

Add new organisms to an existing pangenome
nGenes

Get the total number of genes in a pangenome
groupNames

Get and set the names of gene groups in the pangenome
hasGeneInfo

Checks for existance of gene location information
genes

Extract gene sequences from a pangenome
hasParalogueLinks

Checks whether linking of paralogues has been done
plotTree

Plot a dendrogram of the organisms in a pangenome
kmerSimilarity

Calculate a similarity matrix based on kmers
gpcGrouping

Guided Pairwise Comparison grouping of genes
.loadPgExample

Load an example pangenome
nGeneGroups

Get the number of gene groups in a pangenome
plotEvolution

Plot the evolution in gene groups
nOrganisms

Get the number of organisms represented in a pangenome
pgInMem-class

FindMyFriends standard base class for pangenomic data
pgFull-class

Class for in memory pangenome data
translated

Check the sequence type of the pangenome
readAnnot

Import annotation from an .annot file
orgInfo

Get and set information about organisms
pgLM-class

Class for reference based pangenome data
plotSimilarity

Create a heatplot with similarities between all organisms
pgFullLoc-class

Class for in memory pangenome data with location information
variableRegions

Detect regions of high variability in the panchromosome
pgVirtualLoc-class

Superclass for gene location aware pangenome
internal-mergePangenomes

Merge information from two pangenomes
pgInMemLoc-class

Superclass for gene location aware pangenome
orgStat

Calculate statistics about each organism
groupInfo

Get and set information about gene group
reportGroupChanges

Reports the change in grouping
plotStat

Plot (very) basic statistics on the pangenome
seqToGeneGroup

Get gene-to-genegroup relationship
getRep

Get a representative sequence for each gene group
FindMyFriends-package

FindMyFriends: Comparative microbial genomics in R
removeGene

Remove genes from a pangenome
pgVirtual-class

Base class for pangenomic data
geneNames

Get and set the names of the genes in the pangenome
.fillDefaults

Assign object defaults to missing values
hasGeneGroups

Check whether gene groups are defined
internal-groupGenes

Add gene grouping to pangenome
neighborhoodSplit

Split gene groups by neighborhood synteny
geneLocation

Get gene location for all genes
plotGroup

Plot the similarities of genes within a group
collapseParalogues

Merge paralogue gene groups into new gene groups
addGroupInfo

Safely add group info
groupStat

Calculate statistics about each gene group
graphGrouping

Use igraph to create gene grouping from a similarity matrix
internal-metadata

Add metadata to the pangenome
geneWidth

Get the sequence length of each gene
manualGrouping

Define gene grouping manually
pgLMLoc-class

Class for reference based pangenome data with location information
pgSlim-class

Class for pangenome data with no reference to genes
pcGraph

Calculate the panchromosome graph
plotNeighborhood

Plot the neighborhood of a gene group
addOrgInfo

Safely add organisms info
kmerLink

Link gene groups by homology
kmerSplit

Split gene groups based on similarity
pangenome

Construct a pangenome from fasta files
pgMatrix

Get the pangenome matrix
seqToOrg

Get gene-to-organism relationship