kebabsData: KeBABS Sequence Data

Description

The package contains two small sequence datasets for demonstration of the package functionality. TFBS is a subset of EP300/CREBBP binding data provided with the publication Lee et al., 2011. The data is based on binding sites identified with ChIP-seq by Visel et al., 2009. Please note that due to package size restrictions only a small subset of the data used in Lee et al., 2011 is included in the package. Following variables are defined:

enhancerFBcontains 259 DNA sequences of tissue specific enhancers from embryonic day 11.5 mouse embryos and 241 negative sequences sampled from mm9 genome.
yFBcontains the associated labels

CCoil is a set of heptad-annotated amino acid sequences of coiled coil proteins forming dimers or trimers from the web site of the package PrOCoil by Mahrenholz et. al., 2011. The data contains the sequences with heptad annotation, the oligomerization state and group assignment for each sequence. The grouping was performed through single linkage clustering of sequence similarities based on pairwise ungapped alignment. Following variables are defined:

ccseqcontains 477 AA sequences of heptad-annotated amino acid sequences with a minimum length of 8 and a maximun length of 123 AAs.
yCCcontains the associated oligomerization state "DIMER" or "TRIMER".
ccannotis a charcter vector with the heptad annotations for the sequences. Characters 'a' to 'f' represent specific positions within the coiled coil structure. The AA string set already contains the annotation as metadata. But for demonstration purpose it is available as separate data item.
ccgroupsis a numeric vector containing the group numbers of of the sequences.

Arguments

format

TFBS contains the 259 positive and 241 negative sequences as DNAStringSet and the corresponding labels as numeric vector containing a value of 1 for positive and -1 for negative samples.

CCoil contains the 477 AA sequences as AAStringSet and the corresponding labels as factor. The heptad anntoation is stored as character vector and group assignment as numeric vector.

source

TFBS: http://www.beerlab.org/p300enhancer CCoil: http://www.bioinf.jku.at/software/procoil/data.html

References

(Lee, 2011) -- D. Lee, R. Karchin and M. A. Beer. Discriminative prediction of mammalian enhancers from DNA sequence. Genome Research, 21(12):2167-2180, 2011. (Visel, 2009) -- A. Visel, M. J. Blow, Z. Li, T. Zhang, J. A. Akiyama, A. Holt, I. Plajzer-Frick, M. Shoukry, C. Wright, F.Chen, V. Afzal, B. Ren, E. M. Rubin and L. A. Pennacchio. ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature, 457(7231):854-858, 2009. (Mahrenholz, 2011) -- C. Mahrenholz, I. Abfalter, U. Bodenhofer, R. Volkmer and S. Hochreiter. Complex networks govern coiled-coil oligomerizations - predicting and profiling by means of a machine learning approach.