Learn R Programming

LOLA: Genomic Locus Overlap Enrichment Analysis

Links to LOLA websites:

LOLA is an R package providing functions for testing overlap of sets of genomic regions with public and custom databases. You can think of it as testing your bed file (genome regions of interest) against a database of other bed files (regions from various previous studies) to look for enrichment of overlaps. This enables you to draw connections between newly generated data, and the growing public databases, leading to new hypotheses and annotation sharing.

This README provides a package overview, motivation, and installation instructions. For detailed documentation of functions and additional examples, please see the R documentation.


Installing LOLA

The release version of LOLA can be installed directly from Bioconductor:

source("http://bioconductor.org/biocLite.R")
biocLite("LOLA")

To install the development version directly from github, make sure you have GenomicRanges (bioconductor package) installed, then install LOLA with devtools:

source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")
devtools::install_github("sheffien/LOLA")

Or, clone the repo and install from there:

install.packages("path/to/LOLA", repos=NULL)

Running LOLA

For examples and workflows for LOLA, please check out the following R vignettes to get you started:


LOLA Core Database

You can download a core region set database, or (regionDB). There are two download options: you can download pre-cached .RData files, which LOLA can load in about 30 seconds (requires the simpleCache R package); or the complete database which additionally includes raw text region files, which LOLA can load and cache in about 30 minutes. LOLA Core currently contains region sets from hg19/hg38 and mm10.

In addition to the LOLA Core database, we also maintain a second database, LOLA Exended, which has additional region sets, which are not as well curated as the Core database (detailed contents are listed below)

The latest LOLA Core and Extended databases can be downloaded here:

I recommend using the cached version, unless you need the raw files for something else. To do this, you'll need to grab simpleCache (which you may find it useful for other projects, too), also installable with devtools.

devtools::install_github("sheffien/simpleCache")

Current contents of LOLA core:

Current contents of LOLA Extended:

  • hg19/hg38
    1. Roadmap epigenomics regions
    2. JASPAR motif matches

We're actively adding new collections, so stay tuned. Please contribute! LOLA Core is just the beginning: you can add your own region sets to test enrichment with whatever you like. Here's how to build a custom database:


Building a custom database

LOLA can read your custom region sets the same way it reads LOLA Core. Check out the raw LOLA Core database for an example of how to organize your own custom database. Your custom database is a bunch of genomic regions, organized first into region sets, and then into collections of region sets.

A bit of terminology:

  • Region set: several regions with some shared biological annotation, like a ChIP-seq experiment, represented by a bed file.
  • Collection: a named group of bed files

Start by creating collections of bed files: a collection is just a folder with a bunch of bed files in it. For example, you may find a paper that produced 100 data sets you're interested in testing for overlap with your data. To make a collection for the paper, create a folder, name it after the paper, and then put the 100 bed files into a subfolder called regions. Drop this collection into a parent database folder (perhaps hg19) that holds all your collections, and you're good to go!

If you find yourself creating lots of custom collections, you should consider sharing them to improve the LOLA Core database! I'm always looking for additional datasets to add.

Basic minimal requirements for a collection

A collection is a folder that contains the following items:

  1. regions/ subfolder with bed-like (chr,start,stop) files inside (REQUIRED)
  2. collection.txt file describing the collection (RECOMMENDED)
  3. index.txt file describing the regions (RECOMMENDED)
  4. Scripts or descriptions on how to reproduce the collection (OPTIONAL)
Guidelines for collections
  • All region sets within the collection folder should be in a subfolder named regions.

  • For convenience and efficiency, aim for collections between 50 and 2000 region sets. Around 250 is ideal. The software can handle less if you have to, but try lumping small collections together logically, if possible. It will make it easier to organize things in the future. If you lump different sources together, make sure to annotate with appropriate column (see below).

  • Name your collection folder something short and informative. The name of the first author of the paper is good, if the collection is completely or mostly derived from a single paper. Otherwise, something general describing all the files.

  • Name your bed files something short and informative. If you provide no additional annotation information (see below), this (along with the collection name) will be the only way to identify the region set. No need to put the collection name into the bed name.

Annotating collections

You should annotate your collections by putting a file named collection.txt into each collection folder. This file should be a 2-line TSV file with a header line including these columns:

  • collector (your name)
  • date (time you produced the collection)
  • source (paper or website where you got the data)
  • description (free form field for details)

Example file:

collectordatesourcedescription
John Doe2015-01-03Ziller et al.(2014)Methylation data downloaded from the Ziller paper, files renamed and curated manually.
Annotating region sets

You should annotate your region sets by putting a file named index.txt into each collection folder. This is not required, but suggested. This file should be a TSV file with a header line including at least a column called filename, which points to files in that collection folder. You can then add additional annotation columns; LOLA will recognize and use these column names:

  • filename (must match files in the collection exactly)
  • description
  • cellType
  • tissue
  • antibody (for ChIP experiments)
  • treatment
  • dataSource (for publication author, database, etc.)

Any other column names will be ignored, so add whatever else you like. You can also feel free to annotate as many or as few columns, and rows, as you wish. LOLA will simply use as much annotation information as you give it, defaulting to identifying a sample with only the file name if you provide nothing else. So, for example, a 0index file may look like this:

filenamecellTypeantibody
regionset1.bedK562GATA2
regionset2.bedK562CTCF

These collection.txt and index.txt annotation files are put inside the collection folder so that a collection is a self-contained entity that can be easily moved.

Example custom database

Your folder hierarchy looks something like this:

  • regionDB
    • hg19
      • collection1
        • collection.txt
        • index.txt
        • regions/
          • regionset1.bed
          • regionset2.bed
          • regionset3.bed
      • collection2
        • collection.txt
        • index.txt
        • regions/
          • regionset1.bed
          • regionset2.bed
          • regionset3.bed
      • collection3
        • collection.txt
        • regions/
          • bed files...

Then simply pass the regionDB/hg19 folder (the parent folder containing your collections) to loadRegionDB() and it will automatically read and annotate your region collections.

Tips
  • Your region files really just need the first 3 columns to be chr, start, and end -- no need to follow exact bed specifications.

  • Your files don't have to end with .bed -- just make sure they are text files. Right now there's no gzip file reading, but this may change in the future.

  • You don't have to annotate each file in a collection in the same way, but it's helpful. Just put in whatever you have and LOLA will default to file name for files you don't annotate better.

  • You could create your initial index.txt file by executing ls > index.txt in a collection folder. Now, add a first line containing filename, open in spreadsheet software, and start annotating!

  • You could stick other annotation files in the parent collection folder if you want. LOLA will ignore them.

  • On first load of a collection, LOLA will automatically produce a file called sizes.txt containing the size of each set.

  • Make sure all files in a collection, and all collections in parent folder, use the same reference genome.

  • If you have a single file with different collections (like a segmentation), you can use a function splitFileIntoCollection() to divide it into separate bed files so LOLA can understand it.

Copy Link

Version

Version

1.2.2

License

GPL-3

Last Published

February 15th, 2017

Functions in LOLA (1.2.2)

readCollectionFiles

Given a database and a collection, this will create the region annotation data.table; either giving a generic table based on file names, or by reading in the annotation data.
readBed

Imports bed files and creates GRanges objects, using the fread() function from data.table.
readRegionSetAnnotation

Given a folder containing region collections in subfolders, this function will either read the annotation file if one exists, or create a generic annotation file.
readCollectionAnnotation

Read collection annotation
readRegionGRL

This function takes a region annotation object and reads in the regions, returning a GRangesList object of the regions.
getRegionSet

Grab a single region set from a database, specified by filename.
writeDataTableSplitByColumn

Given a data table and a factor variable to split on, efficiently divides the table and then writes the different splits to separate files, named with filePrepend and numbered according to split.
sampleGRL

Function to sample regions from a GRangesList object, in specified proportion
mergeRegionDBs

Given two regionDBs, (lists returned from loadRegionDB()), This function will combine them into a single regionDB. This will enable you to combine, for example, LOLA Core databases with custom databases into a single analysis.
readCollection

Given a bunch of region set files, read in all those flat (bed) files and create a GRangesList object holding all the region sets. This function is used by readRegionGRL to process annotation objects.
redefineUserSets

This function will take the user sets, overlap with the universe, and redefine the user sets as the set of regions in the user universe that overlap at least one region in user sets. this makes for a more appropriate statistical enrichment comparison, as the user sets are actually exactly the same regions found in the universe otherwise, you can get some weird artifacts from the many-to-many relationship between user set regions and universe regions.
listToGRangesList

converts a list of GRanges into a GRangesList; strips all metadata.
countOverlapsAnyRev

Just a reverser. Reverses the order of arguments and passes them untouched to countOverlapsAny -- so you can use it with lapply.
checkUniverseAppropriateness

Check universe appropriateness
cleanws

cleanws takes multi-line, code formatted strings and just formats them as simple strings
setSharedDataDir

setSharedDataDir Sets global variable specifying the default data directory.
buildRestrictedUniverse

If you want to test for differential enrichment within your usersets, you can restrict the universe to only regions that are covered in at least one of your sets. This function helps you build just such a restricted universe
userUniverse

A reduced GRanges object from the example regionDB database
userSets

An example set of regions, sampled from the example database.
write.tsv

Wrapper of write.table that provides defaults to write a simple .tsv file. Passes additional arguments to write.table
replaceFileExtension

This will change the string in filename to have a new extension
extractEnrichmentOverlaps

Given a single row from an enrichment table calculation, finds the set of overlaps between the user set and the test set. You can then use these, for example, to get sequences for those regions.
splitDataTable

Efficiently split a data.table by a column in the table
runLOLA

Enrichment Calculation
splitFileIntoCollection

This function will take a single large bed file that is annotated with a column grouping different sets of similar regions, and split it into separate files for use with the LOLA collection format.
LOLA

Provides functions for genome location overlap analysis.
nlist

Named list function.
lapplyAlias

Function to run lapply or mclapply, depending on the option set in getOption("mc.cores"), which can be set with setLapplyAlias().
setLapplyAlias

To make parallel processing a possibility but not required, I use an lapply alias which can point at either the base lapply (for no multicore), or it can point to mclapply, and set the options for the number of cores (what mclapply uses). With no argument given, returns intead the number of cpus currently selected.
listRegionSets

Lists the region sets for given collection(s) in a region database on disk.
loadRegionDB

Helper function to annotate and load a regionDB, a folder with subfolder collections of regions.
writeCombinedEnrichment

Function for writing output all at once: combinedResults is an table generated by "locationEnrichment()" or by rbinding category/location results. Writes all enrichments to a single file, and also spits out the same data divided into groups based on userSets, and Databases, just for convenience. disable this with an option.