The 'pipeline' covers most of the required steps to convert chromatin aligned
reads information to piled-up enrichments scores (ie. WIG files).It makes use of the package functions to :
- Read and store in memory dataset(s) provided by the user
- Generate eventual subgroups of interest based on the insert size (in
case of paired-end experiment) or on the reads size, providing statistics
about the regions covered and the number of reads concerned
- Help to identify and remove artefactual enrichments
- Estimate in-silico the size of original DNA fragments (inserts, in
case of single-end experiments), or plotting the insert size distribution
otherwise
- Extension and piling of reads with several options to obtain a score
per genomic coordinates
- Writing results as WIG variable, WIG fixed-step, or GFF files
The flexible implementation allows users to specify a bunch of parameters values
for each experiment (using an advanced options system, see details). The
pipeline will then loop over the different values and provide results in
separate folders.
It also provides mechanisms to handle orphan reads (incomplete pairs) all along
the process and deal with reads that aligned in multiple locations (see specific
functions below).
The work can also be spread automatically on several processors/cores (at the
expense of memory) by taking advantage of the parallelization options offered by
'parallel'.
The pipeline (started by processPipeline) defines a workflow using the main
functions provided in the package with an additional layer dedicated to
automatization, graphics/QC, summary of experiments, and organization of
resulting files.
How to specify experiment files parameters 'INPUTFilesList'
It is possible to define a list of experiment that will be processed
sequentially.
This argument is a list of list. Each element of this list is freely named by
the user in order to recognize the experiment. These names will be used to name
the experiment in logs and results.
Each element contains a nested list with fixed names that must match the
followings : 'folderName', 'fileName', 'fileType', 'chrPrefix', 'chrSuffix',
'pairedEnds', 'midPoint'
- folderName
- is the full path to the folder containing the file
containing aligned reads to be loaded.
fileNameis the filename containing the aligned data, typically the
output from a program of read alignment.
fileTypedefines the format in which the data is stored in the file.
The pipelinehandle preferentially standard BAM files ('BAM' value), but can
also read several proprietary format by using the ShortRead library (see
readAligned function from ShortRead library for a complete list of
supported formats).
chrPrefixis a regular expression string defining the prefix
concatenated to each chromosome name. Typically this value is 'chr' but can
be different depending on how the reference genjome for alignment has been
created and named.
chrSuffixsee 'chrPrefix' argument. Typically, this is defined as an
empty string ''.
pairedEndsa logical value specifying is the experiment should be
considered as single or paired -end. In the latter case, only BAM files are
supported and the format has to be cin conformity with BAM policy, such as
bowtie output.
midPointa logical value. If TRUE a specific piling method will be
applied to the reads and the output files will have '_midpoint' suffix. See
below for details on midpoint method.
Graphics
Depending on the type of experiment, several graphics can be produced in the
output log folder.
In case of paired-end experiments, a plot will be produced to give information
about the distribution of inserts size.
For single-end experiments, if no manual elongation size is provided
(recommended), the elongation estimation module will produce a plot summarizing
the estimation for each chromosome.
If several subgroups (based on insert size or reads size) are asked, a global
distribution plot will be produced on which the concerned subpopulation is
highlighted.
The 'artefacts' module will also produce a summary sheet for each chromosome
with several graphics (for each strand separately). First, a distribution of
the piles (reads sharing the exact same coordinates) heights on the chromosome.
Second, a cumulative plot describing the proportion of reads in the total
population that are lost for several choices of threshold (see
getArtefactIndexes function). Finally, a summary of the number of reads removed
for different thresholds and the one choosen by the user (plus more information
about orphan reads in case of paired-ends experiments).
Finally if arguments 'annotationFilesGFF' and 'annotationGenomeFiles' are
specified and valid, a report will be generated to summarize the distribution
of the reads in the annotation files (for each range in case rangeSelection
argument is provided).
Experiments summary
During the processing, the pipeline provide a lot of information about the
experiment and the current operations.
All these informations are saved and stored in the log files. Among these
information, one will find the number of reads in the experiment (aligned and
unaligned if provided in the original file), a summary of the reads size,
alignment scores, and all parameters that were asked.
Results File Organization
Because the pipeline allow to specify a lot of different parameters, a
hierarchical organization of resulting files is produced.
First, the 'resultSubFolder' will be produced in the folder where the input
file is after appending a suffix (_PE for paired-ends experiments and _SE for
single-end ones).
In the latter one, another subfolder is created for each subgroup (based on
insert size or reads size). all final results (merged WIGs and GFFs) will be
generated in these folders. A subfolder named as specified in
'reportFilesSubFolder' will also be created and filled with figures and log
files. Another subfolder will be created to store temporary output files (see
keepTemp parameter).
Complex parameters
The pipeline automatization lies on a specific mechanism for some parameters
('incrArtefactThrEvery', 'binSize', 'elongationSize', 'rangeSelection',
'annotationFilesGFF').
First, these arguments can be used as single values. In this case, the same
value will be used for all experiments described in 'INPUTFilesList' argument.
Second, one can specify a different value for each experiment defined in
'INPUTFilesList' argument. For this concern, the user has to create a named
list with as many elements as number of experiments, and list element names
must correspond to names in 'INPUTFilesList' argument.
Finally, it is possible to give several values for each or all experiments. In
this case, the pipeline will loop over these values, recompute what has to be
recomputed, and write resulting files with differentiated filenames.
A specific function is dedicated for checking the consistency of parameters
provided for one or all experiments, and raise an error in case of conflictual
or invalid values.
Rehabilitation steps
For paired-ends experiments, the pipeline tries to save the 'half-pairs'
(called 'orphans'). In brief, reads from incomplete pairs (mate missing because
of misalignment) are separated from the other ones.
Later, when the average insert size has been estimated, it is used for
rehabilitation of incomplete pairs by a step of elongation prior to piling of
these reads. These pileups subcategories are then merged with the other reads
to produce the final result files.
Because pairs can also be broken when eliminating PCR artefacts, a second
category called 'orphansFromArtefacts' are treated the same way, producing
another temporary output that is merged with the other reads.
The user is asked to consider or ignore these reads with the argument
'rehabilitationStep'. Finally the user can also look at the separate categories
result files (WIGs and GFFs) prior merging by using the keepTemp parameter.
Midpoint
the midpoint piling strategy has mainly been thought for MNase experiments
which allows to track for nucleosome positionning. These experiments assume
that each sequenced DNA fragment originate from DNA wrapped around nucleosome
and that non-protected DNA is digested. Thus extremities of fragments are
supposed to represent the boudaries of nucleosomes.
Instead of a classic elongation and piling (which represents the nucleosome
density), the midpoint strategy will only consider the extremities or the exact
center of each DNA fragment (depending on 'elongation parameter', see
package-attached pdf document for a summary), giving a more accurate view of
nucleosome positionning as opposed to the nucleosome density typically
observed.
Range selection
When using the argument 'rangeSelection' with either a numeric value or an
IRanges object, the pipeline will consider separately the reads (or pair of
reads) with different size.
If a single number 'n' is specified, the program will attempt to cut the size
distribution in 'n' intervals with an equal number of reads/pairs in each
(quantile function is used), and process each group separately. When there is
more than one group or if the group does not include all the reads/pairs, the
pipeline will automatically add another group containing all reads (referred as
'allReads').
When the user defines IRanges objects for 'rangeSelection' argument, the
pipeline will process separately each group of size defined by the intervals in
the object. NOTE : the lower value of each interval is EXCLUDED from the group
selection, whereas the upper one is INCLUDED. In this case, only the groups
defines by user are processed (no 'allReads' is added).
Results from each group will be written in separate folders identified by the
concerned range of size.
Compatibility mode
Prior to version 0.99.21, WIG files generated by the pipeline included a track
line repeated before each chromosome description. This is considered as a bug
since it does not fit with the standard WIG file description (UCSC). It has been
corrected in newer versions but users can revert to previous behaviour using
'compatibilityOutputWIG' (in case they are using tools relying on this
non-standard format). A warning will be reported and the use of this option is
strongly discouraged as it will prevent other users from using your data or any
submission to public databases (Gene Expression Omnibus for instance).
------------------------------------------
Multireads
The functions dedicated to multiread produce output that can then be injected
in the main pipeline. This information will be integrated in the signal to
produce the final results (merged WIG/GFF files) taking in account the
multireads.
Step 1 - Align the reads
------------------------
To align the reads, you will run Bowtie command with the options requested to
get read aligned in several location reported.
In order to get an output file not too big (since multiread alignement can
bring very large files), the "--concise" option of Bowtie must be used. Other
options are standard.
For instance, if you want to keep the read aligning on the mm9 genomes 100 times
or less, authorizing 2 mismatches and run on 8 processors
bowtie -q -v 2 -a -m 100 -p 8 --concise mm9 input_file.fastq > output_file.bow
Step 2 - Prepare the genome reference file
------------------------------------------
The scripts used in the next steps requires a reference file containing
information on the chosen genome.
This reference file is a simple text file with ".ref" extension that must
contain:
- first line the number of chromosomes in the considered genome
- second line, the list of chromosomes names (separated by spaces)
- third line, the list of chromosomes sizes (separated by spaces)
You can use the 'bowtie-inspect' command on the reference genome to get the
information requested to build this reference file.
Step 3 - Dispatching signal along multiread positions
-----------------------------------------------------
You have now to decide how each multiread read score will be dispatched along
its mutiple positions. Two options are offered here:
* Uniform method: The signal is equally dispatched along the read position i.e.
for instance, if a read has 10 aligned positions, each position will receive a
score of 1/10. To apply this method, execute the R command:
multiread_UniformDispatch(alignedFile, outputFolder, referenceFile,
incrArtefactThrEvery, verbosity)
where :
- alignedFile
- the .bow file outputed by bowtie in step 3 (or in step 1
if you decided not to remove artifacts)
- outputFolder
- the folder where you want to see the output file fall
in.
referenceFilethe .ref file created in step 2
incrArtefactThrEverythe ratio used to detect artifact. Default value
is NA.
verbositythe verbose level (0 = no message, 1 = trace level).
Default value is 0.
This command will output a file named with the alignedFile name suffixed by
"_uniformDispatch.txt".
* Chung et al. method: The score on read position is allocated according to an
algorithm developped by Chung et al. (see "Discovering Transcription Factor
Binding Sites of Genomes with Multi-Read Analysis of ChIP-seq Data" (2011) PLoS
Computational Biology). To apply this method, execute the R command:
multiread_CSEMDispatch( alignedFile, outputFolder, referenceFile, window_size,
iteration_number, incrArtefactThrEvery, verbosity)
where :
- alignedFile
- the .bow file outputed by bowtie in step 3 (or in step 1
if you decided not to remove artifacts)
outputFolderthe folder where you want to see the output file fall
in.
referenceFilethe .ref file created in step 2
window_sizethe size of the window used by the algorithm (see
algorithm details). Suggested value is 101.
iteration_numberthe number of iteration executed by the algorithm.
Suggested value is 200.
incrArtefactThrEverythe ratio used to detect artifact. Default value
is NA.
verbositythe verbose level (0 = no message, 1 = trace level).
Default value is 0.
This command will output a file named with the alignedFile name suffixed by
"_csemDispatch.txt".
Important note:
For both commands, if the parameter 'incrArtefactThrEvery' is set to a strictly
positive value, the input file is first passed to a script removing the
'artifacts'. This script looks at the reads aligned at the exact same position.
If the number of such reads exceed a limit, these reads are considered as due to
experiment artifact.
In that case, only one read is kept and the rest is removed.
The limit defining artefact is controled by the parameter
'incrArtefactThrEvery' such as:
limit = Total number of reads / incrArtefactThrEvery
Step 4 - Launch Pasha for multiread
----------------------------------
One you have obtained a file containing the information of the dispatched
scores over the read positions, you can use it in Pasha by passing the obtained
file in the variable 'multiLocFilesList'. See details above on
'multiLocFilesList' parameter.
Step 5 - Analyze repeat distribution
------------------------------------
Once Pasha has been launched, you have obtained WIG files. Running the
following command will permit you to obtain information and statistics about
the dispertion of your signal on annotations identified as repeated regions in
the genome.
Use the following command to launch the script:
WigRepeatAnalyzer(filename, inputFolder, outputFolder, repeatMaskerFilePath,
isRegex)
where:
- filename
- the file name of the wig file (fixed step WIG).
inputFolderthe path to the wig file.
outputFolderthe path to the folder where analysis results must be
stored.
repeatMaskerFilePaththe path to the file containing the repeat
annotations (Repeat Masker file).
isRegexIf TRUE, the filename parameter is interpreted as a regular
expression (LC_SYNTAX) and the script will search for a unique file
corresponding to the provided regular expression. If no or several file are
found, the scripts ends with error.
The script compute the coverage of each repeat class and family (i.e. the
percentage of positions falling into each annotations) and the weight of each
class and family (i.e. the percentage of score falling into each annotations).
All results are provided as barplots figures and text files.