Learn R Programming

gggenomes

A grammar of graphics for comparative genomics

gggenomes is a versatile graphics package for comparative genomics. It extends the popular R visualization package ggplot2 by adding dedicated plot functions for genes, syntenic regions, etc. and verbs to manipulate the plot to, for example, quickly zoom in into gene neighborhoods.

A realistic use case comparing six viral genomes

gggenomes makes it easy to combine data and annotations from different sources into one comprehensive and elegant plot. Here we compare the genomic architecture of 6 viral genomes initially described in Hackl et al.: Endogenous virophages populate the genomes of a marine heterotrophic flagellate

library(gggenomes)

# to inspect the example data shipped with gggenomes
data(package="gggenomes")

gggenomes(
  genes = emale_genes, seqs = emale_seqs, links = emale_ava,
  feats = list(emale_tirs, ngaros=emale_ngaros, gc=emale_gc)) |> 
  add_sublinks(emale_prot_ava) |>
  sync() + # synchronize genome directions based on links
  geom_feat(position="identity", size=6) +
  geom_seq() +
  geom_link(data=links(2)) +
  geom_bin_label() +
  geom_gene(aes(fill=name)) +
  geom_gene_tag(aes(label=name), nudge_y=0.1, check_overlap = TRUE) +
  geom_feat(data=feats(ngaros), alpha=.3, size=10, position="identity") +
  geom_feat_note(aes(label="Ngaro-transposon"), data=feats(ngaros),
      nudge_y=.1, vjust=0) +
  geom_wiggle(aes(z=score, linetype="GC-content"), feats(gc),
      fill="lavenderblush4", position=position_nudge(y=-.2), height = .2) +
  scale_fill_brewer("Genes", palette="Dark2", na.value="cornsilk3")
  
ggsave("emales.png", width=8, height=4)

For a reproducible recipe describing the full evolution of an earlier version of this plot with an older version of gggenomes starting from a mere set of contigs, and including the bioinformatics analysis workflow, have a look at From a few sequences to a complex map in minutes.

Motivation & concept

Visualization is a corner stone of both exploratory analysis and science communication. Bioinformatics workflows, unfortunately, tend to generate a plethora of data products often in adventurous formats making it quite difficult to integrate and co-visualize the results. Instead of trying to cater to the all these different formats explicitly, gggenomes embraces the simple tidyverse-inspired credo:

  • Any data set can be transformed into one (or a few) tidy data tables
  • Any data set in a tidy data table can be easily and elegantly visualized

As a result gggenomes helps bridge the gap between data generation, visual exploration, interpretation and communication, thereby accelerating biological research.

Under the hood gggenomes uses a light-weight track system to accommodate a mix of related data sets, essentially implementing ggplot2 with multiple tidy tables instead of just one. The data in the different tables are tied together through a global genome layout that is automatically computed from the input and defines the positions of genomic sequences (chromosome/contigs) and their associated features in the plot.

Inspiration

gggenomes draws inspiration from some brilliant packages, in particular:

Installation

gggenomes is at this point in an alpha release state, and only available as a developmental package from github.

# if you don't have it
install.packages("devtools") 

# install gggenomes
devtools::install_github("thackl/gggenomes")

# optionally install ggtree to plot genomes next to trees
# https://bioconductor.org/packages/release/bioc/html/ggtree.html
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("ggtree")

Copy Link

Version

Install

install.packages('gggenomes')

Monthly Downloads

421

Version

1.0.1

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Thomas Hackl

Last Published

August 30th, 2024

Functions in gggenomes (1.0.1)

drop_link_layout

Drop a link layout
drop_seq_layout

Drop a seq layout
emale_tirs

Terminal inverted repeats of 6 EMALE genomes
focus

Show features and regions of interest
flip

Flip bins and sequences
ex

Get path to gggenomes example files
geom_bin_label

Draw bin labels
emale_prot_ava

All-versus-all alignments 6 EMALE proteomes
emale_genes

Gene annotations if 6 EMALE genomes (endogenous virophages)
flip_strand

Flip strand
emale_seqs

Sequence index of 6 EMALE genomes (endogenous virophages)
emale_ngaros

Integrated Ngaro retrotransposons of 6 EMALE genomes
geom_link

Draw links between genomes
geom_variant

Draw place of mutation
geom_gene

Draw gene models
geom_feat

Draw feats
geom_seq_label

Draw seq labels
geom_seq_break

Decorate truncated sequences
geom_gene_label

Draw feat/link labels
geom_feat_text

Add text to genes, features, etc.
geom_coverage

Draw wiggle ribbons or lines
introduce

Introduce non-existing columns
if_reverse

Vectorised if_else based on strandedness
gggenomes

Plot genomes, features and synteny maps
has_vars

Check if variables exist in object
in_range

Do numeric values fall into specified ranges?
get_seqs

Get/set the seqs track
is_reverse

Check whether strand is reverse
ggplot.gggenomes_layout

ggplot.default tries to fortify(data) and we don't want that here
geom_seq

draw seqs
pick

Pick bins and seqs by name or position
layout_seqs

Layout sequences
layout

Re-layout a genome layout
layout_genomes

Layout genomes
read_alitv

Read AliTV .json file
read_bed

Read a BED file
feats

Use tracks inside and outside geom_* calls
position_variant

Plot types of mutations with different offsets
read_context

Read files in different contexts
qw

Create a vector from unquoted words.
position_strand

Stack features
read_blast

Read BLAST tab-separated output
require_vars

Require variables in an object
read_paf

Read a .paf file (minimap/minimap2).
scale_color_variant

Default colors and shapes for mutation types.
read_tracks

Read files in various standard formats (FASTA, GFF3, GBK, BED, BLAST, ...) into track tables
reexports

Objects exported from other packages
read_seq_len

Read sequence index
read_gff3

Read features from GFF3 (and with some limitations GFF2/GTF) files
scale_x_bp

X-scale for genomic data
read_vcf

Read a VCF file
read_gbk

Read genbank files
strand_chr

Convert strand to character
swap_query

Swap query and subject in blast-like feature tables
swap_if

Swap values of two columns based on a condition
split_by

Split by key preserving order
track_info

Basic info on tracks in a gggenomes object
shift

Shift bins left/right
set_class

Modify object class attriutes
width

The width of a range
vars_track

Tidyselect track variables
unnest_exons

Unnest exons
track_ids

Named vector of track ids and types
write_gff3

Write a gff3 file from a tidy table
theme_gggenomes_clean

gggenomes default theme
strand_lgl

Convert strand to logical
strand_int

Convert strand to integer
as_sublinks

Compute a layout for links linking feats
combine_strands

Combine strands
add_feats

Add different types of tracks
check_strand

Check strand
add_seqs

Add seqs
drop_feat_layout

Drop feature layout
def_names

Default column names and types for defined formats
def_formats

Defined file formats and extensions
dim.gggenomes_layout

ggplot2::facet_null checks data with empty(df) using dim. This causes an error because dim(gggenome_layout) is undefined. Return dim of primary table instead
as_feats

Compute a layout for feat data
as_subfeats

Compute a layout for subfeat data
as_links

Compute a layout for link data
as_seqs

Compute a layout for sequence data
GeomFeatText

Geom for feature text
emale_cogs

Clusters of orthologs of 6 EMALE proteomes
drop_layout

Drop a genome layout
emale_ava

All-versus-all whole genome alignments of 6 EMALE genomes
emale_gc

Relative GC-content along 6 EMALE genomes