extdata: Extra Data

Description

The files in the subdirectories of extdata support the examples in the package documentation and vignettes.

Arguments

code

thermo

itemize

OBIGT-2.csvcontains supplementary thermodynamic data in the same format as the primary database indata/OBIGT.csv. Data for some entries in the primary database are taken from different literature sources in this file. The default action ofadd.obigtis to add the contents of this file to CHNOSZ's working database inthermo$obigt. Seediagramand the code ofanim.TCAfor examples that use this file.
obigt_check.csvcontains the results of runningcheck.obigtto check the internal consistency of entries in the primary and supplementary databases.
groups_big.csvGroup contribution matrix: five structural groups on the columns ([-CH3],[-CH2-],[-CH2OH],[-CO-],[-COOH]) and 24 compounds on the rows (alkanes, alcohols, ketones, acids, multiply substituted compounds).
groups_small.csvGroup contribution matrix: twelve bond-specific groups on the columns, and 25 compounds on the rows (as above, plus isocitrate). Group identity and naming conventions adapted from Benson and Buss (1958) and Domalski and Hearing (1993). See thexadditivityvignette for examples that use this file andgroups_big.csv.

Details

Files in abundance contain protein abundance data:

AA03.csvhas reference abundances for 71 proteins taken from Fig. 3 of Anderson and Anderson, 2002 (as corrected in Anderson and Anderson, 2003). The columns with data taken from these sources are type (hemoglobin, plasma, tissue, or interleukin), description (name used in the original figure), log10(pg/ml) (upper limitof abundance interval shown in Anderson and Anderson, 2003, log10 of concentration in pg/ml). The additional columns are data derived from a search of the SWISS-PROT/UniProtKB database based on the descriptions of the proteins: name (nominal UniProtKB name for this protein), name2 (other UniProtKB names(s) that could apply to the protein), and note (notes based on searching for a protein of this description). The amino acid compositions of all proteins whose names are not NA are included inthermo$protein. Theabbrvcolumn for the proteins contains the description given by Anderson and Anderson, 2003, followed by (in parentheses) the UniProtKB accession number. Annotated initiator methionines (e.g. for ferritin, myoglobin, ENOG), signal peptides or propeptides were removed from the proteins (except where they are not annotated in UniProtKB: IGHG1, IGHA1, IGHD, MBP). In cases were multiple isoforms are present in UniProtKB (e.g. Albumin) only the first isoform was taken. In the case of C4 Complement (CO4A) and C5 Complement (CO5), the amino acid composition of only the alpha chains are listed. In the case of the protein described as iC3b, the amino acid sequence is taken to be that of Complement C3c alpha' chain fragment 1 from CO3, and is given the name CO3.C3c. The non-membrane (soluble) chains of TNF-binding protein (TNR1A) and TNF-alpha (TNFA) were used. Rantes, MIP-1 beta and MIP-1 alpha were taken from C-C motif chemokines (CCL5, CCL4, CCL3 respectively). C-peptide was taken from the corresponding annotation for insulin and here is named INS.C. See theprotactivvignette for an example that uses this file.
ISR+08.csvhas columns excerpted from Additional File 2 of Ishihama et al. (2008) for protein abundances inE. colicytosol. The columns in this file are ID (Swiss-Prot ID), accession (Swiss-Prot accession), emPAI (exponentially modified protein abundance index), copynumber (emPAI-derived copy number/cell), GRAVY (Kyte-Doolittel), FunCat (FunCat class description), PSORT (PSORT localisation), ribosomal (yes/no). Seeget.exprand theprotactivvignette for examples that use this file. %% \item \code{GLL+98.csv} has columns "oln" for ordered locus name and "ratio" for change in expression of yeast proteins in response to H2O2 treatment, from Godon et al., 1998. One protein, YMR108W, was listed as both induced and repressed in the original data set and is not included in this table.
yeastgfp.csv.xzHas 28 columns; the names of the first five areyORF,gene name,GFP tagged?,GFP visualized?, andabundance. The remaining columns correspond to the 23 subcellular localizations considered in the YeastGFP project (Huh et al., 2003 and Ghaemmaghami et al., 2003) and hold values of eitherTorFfor each protein.yeastgfp.csvwas downloaded on 2007-02-01 from http://yeastgfp.ucsf.edu using the Advanced Search, setting options to download the entire dataset and to include localization table and abundance, sorted by orf number. Seeyeastgfpfor examples that use this file.

Files in bison contain BLAST results and taxonomic information for a metagenome:

bisonN_vs_refseq47.blast.xz,bisonS_vs_refseq47.blast.xz,bisonR_vs_refseq47.blast.xz,bisonQ_vs_refseq47.blast.xz,bisonP_vs_refseq47.blast.xzare partial tabular BLAST results for proteins in the Bison Pool Environmental Genome. Predicted protein sequences were downloaded from the Joint Genome Institute's IMG/M system on 2009-05-13. The target database for the searches was constructed from microbial protein sequences in National Center for Biotechnology Information (NCBI) RefSeq database version 47, representing 3266 microbial genomes. Theblastallcommand was used with the default setting for E value cuttoff (10.0) and options to make a tabular output file consisting of the top 20 hits for each query sequence. The functionread.blastwas used to extract only those hits with E values less than or equal to 1e-5 and with similarity greater than 30 percent, and to keep only the first hit for each query sequence. The functionwrite.blastwas used to save partial BLAST files (only selected columns). The files provided with CHNOSZ contain the first 5,000 hits for each sampling site at Bison Pool, representing between about 7 to 15 percent of the first BLAST hits after similarity and E value filtering.
gi.taxid.txt.xzis a table that lists the sequence identifiers (gi numbers) that appear in the example BLAST files (see above), together with the corresponding taxon ids used in the NCBI databases. This file was extracted from the completegi_taxid_prot.dmp.gzdownloaded fromftp://ftp.ncbi.nih.gov/pub/taxonomy/on 2011-06-16. A small number (about 0.2 percent) of the gi numbers appearing in the BLAST results were not found ingi_taxid_prot.dmp.gzand therefore are also excluded fromgi.taxid.txt. Seeid.blastfor an example that uses this file and the BLAST files described above.

Files in cpetc contain heat capacity data and other thermodynamic properties:

PM90.csvHeat capacities of four unfolded aqueous proteins taken from Privalov and Makhatadze, 1990. Names of proteins are in the first column, temperature in$^{\circ}$C in the second, and heat capacities in J mol$^{-1}$K$^{-1}$in the third. Seeionizefor an example that uses this file.
RH95.csvHeat capacity data for iron taken from Robie and Hemingway, 1995. Temperature in Kelvin is in the first column, heat capacity in J K$^{-1}$mol$^{-1}$in the second. Seesubcrtfor an example that uses this file.
RT71.csvpH titration measurements for unfolded lysozyme (LYSC_CHICK) taken from Roxby and Tanford, 1971. pH is in the first column, net charge in the second. Seeionizefor an example that uses this file.
SOJSH.csvExperimental equilibrium constants for the reaction NaCl(aq) = Na+ + Cl- as a function of temperature and pressure taken from Fig. 1 of Shock et al., 1992. Data were extracted from the figure using g3data (http://www.frantz.fi/software/g3data.php). Seewaterfor an example that uses this file.
Cp.CH4.HW97.csv,V.CH4.HWM96.csvApparent molar heat capacities and volumes of CH4 in dilute aqueous solutions reported by Hnedkovsky and Wood, 1997 and Hnedkovsky et al., 1996. SeeEOSregressfor examples that use these files.

Files in fasta contain protein sequences:

HTCC1062.faa.xzis a FASTA file of 1354 protein sequences in the organismPelagibacter ubiqueHTCC1062 downloaded from the NCBI RefSeq collection on 2009-04-12. The search term was Protein: txid335992[Organism:noexp] AND "refseq"[Filter]. Seeutil.fastaandrevisitfor examples that use this file.
EF-Tu.alnconsists of aligned sequences (394 amino acids) of elongation factor Tu (EF-Tu). The sequences correspond to those taken from UniProtKB for ECOLI (Escherichia coli), THETH (Thermus thermophilus) and THEMA (Thermotoga maritima), and reconstructed ancestral sequences taken from Gaucher et al., 2003 (maximum likelihood bacterial stem and mesophilic bacterial stem, and alternative bacterial stem). See theformationvignette for an example that uses this file.

Files in protein contain protein composition data:

SGD.csv.xzDataframe of amino acid composition of proteins from theSaccharomycesGenome Database. Contains twenty-two columns. Values in the first column are the rownumbers, the second column (OLN) has the ordered locus names of proteins, and the remaining twenty columns (Ala..Val) contain the numbers of the respective amino acids in each protein; the columns are arranged in alphabetical order based on the three-letter abbreviations for the amino acids. The source of data forSGD.csvis the fileprotein_properties.tabfound on the FTP site of the SGD project on 2008-08-04. Blank entries were replaced with "NA" and column headings were added. Seeget.proteinfor examples that use this file.
ECO.csv.xzContains 24 columns. Values in the first column correspond to rownumbers, the second column {AC

holds the accession numbers of the proteins, the third column (Name) has the names of the corresponding genes, and the fourth column {OLN} lists the ordered locus names of the proteins. The remaining twenty columns (A..Y) give the numbers of the respective amino acids in each protein and are ordered alphabetically by the one-letter abbreviations of the amino acids. The sources of data for ECO.csv are the files ECOLI.dat ftp://ftp.expasy.org/databases/hamap/complete_proteomes/entries/bacteria and ECOLI.fas ftp://ftp.expasy.org/databases/hamap/complete_proteomes/fasta/bacteria downloaded from the HAMAP (High-quality Automated and Manual Annotation of microbial Proteomes system) FTP site (Gattiker et al., 2003) on 2007-12-20. The proteins can be included in calculations using get.protein as well as get.expr; see the protactiv vignette for an example that uses the latter function.

References

Anderson, N. L. and Anderson, N. G. (2002) The human plasma proteome: History, character and diagnostic prospects. Molecular and Cellular Proteomics 1, 845--867. http://dx.doi.org/10.1074/mcp.R200007-MCP200

Anderson, N. L. and Anderson, N. G. (2003) The human plasma proteome: History, character and diagnostic prospects (Vol. 1 (2002) 845-867). Molecular and Cellular Proteomics 2, 50. http://dx.doi.org/10.1074/mcp.A300001-MCP200

Benson, S. W. and Buss, J. H. (1958) Additivity rules for the estimation of molecular properties. Thermodynamic properties. J. Chem. Phys. 29, 546--572. http://dx.doi.org/10.1063/1.1744539

Domalski, E. S. and Hearing, E. D. (1993) Estimation of the thermodynamic properties of C-H-N-O-S-Halogen compounds at 298.15 K J. Phys. Chem. Ref. Data 22, 805--1159. http://dx.doi.org/10.1063/1.555927

Gattiker, A., Michoud, K., Rivoire, C., Auchincloss, A. H., Coudert, E., Lima, T., Kersey, P., Pagni, M., Sigrist, C. J. A., Lachaize, C., Veuthey, A.-L., Gasteiger, E. and Bairoch, A. (2003) Automatic annotation of microbial proteomes in Swiss-Prot. Comput. Biol. Chem. 27, 49--58. http://dx.doi.org/10.1016/S1476-9271(02)00094-4

Gaucher, E. A., Thomson, J. M., Burgan, M. F. and Benner, S. A (2003) Inferring the palaeoenvironment of ancient bacteria on the basis of resurrected proteins. Nature 425(6955), 285--288. http://dx.doi.org/10.1038/nature01977

Ghaemmaghami, S., Huh, W., Bower, K., Howson, R. W., Belle, A., Dephoure, N., O'Shea, E. K. and Weissman, J. S. (2003) Global analysis of protein expression in yeast. Nature 425(6959), 737--741. http://dx.doi.org/10.1038/nature02046

Huh, W. K., Falvo, J. V., Gerke, L. C., Carroll, A. S., Howson, R. W., Weissman, J. S. and O'Shea, E. K. (2003) Global analysis of protein localization in budding yeast. Nature 425(6959), 686--691. http://dx.doi.org/10.1038/nature02026

Ishihama, Y., Schmidt, T., Rappsilber, J., Mann, M., Hartl, F. U., Kerner, M. J. and Frishman, D. (2008) Protein abundance profiling of the Escherichia coli cytosol. BMC Genomics 9:102. http://dx.doi.org/10.1186/1471-2164-9-102

HAMAP system. HAMAP FTP directory, ftp://ftp.expasy.org/databases/hamap/

Hnedkovsky, L., Wood, R. H. and Majer, V. (1996) Volumes of aqueous solutions of CH4, CO2, H2S, and NH3 at temperatures from 298.15 K to 705 K and pressures to 35 MPa. J. Chem. Thermodyn. 28, 125--142. http://dx.doi.org/10.1006/jcht.1996.0011

Hnedkovsky, L. and Wood, R. H. (1997) Apparent molar heat capacities of aqueous solutions of CH4, CO2, H2S, and NH3 at temperatures from 304 K to 704 K at a pressure of 28 MPa. J. Chem. Thermodyn. 29, 731--747. http://dx.doi.org/10.1006/jcht.1997.0192

Joint Genome Institute (2007) Bison Pool Environmental Genome. Protein sequence files downloaded from IMG/M (http://img.jgi.doe.gov/cgi-bin/m/main.cgi?section=FindGenomes&page=findGenomes)

Privalov, P. L. and Makhatadze, G. I. (1990) Heat capacity of proteins. II. Partial molar heat capacity of the unfolded polypeptide chain of proteins: Protein unfolding effects. J. Mol. Biol. 213, 385--391. http://dx.doi.org/10.1016/S0022-2836(05)80198-6

Robie, R. A. and Hemingway, B. S. (1995) Thermodynamic Properties of Minerals and Related Substances at 298.15 K and 1 Bar ($10^5$ Pascals) Pressure and at Higher Temperatures. U. S. Geol. Surv., Bull. 2131, 461 p. http://www.worldcat.org/oclc/32590140

Roxby, R. and Tanford, C. (1971) Hydrogen ion titration curve of lysozyme in 6 M guanidine hydrochloride. Biochemistry 10, 3348--3352. http://dx.doi.org/10.1021/bi00794a005

SGD project. Saccharomyces Genome Database, http://www.yeastgenome.org

Shock, E. L., Oelkers, E. H., Johnson, J. W., Sverjensky, D. A. and Helgeson, H. C. (1992) Calculation of the thermodynamic properties of aqueous species at high pressures and temperatures: Effective electrostatic radii, dissociation constants and standard partial molal properties to 1000 $^{\circ}$C and 5 kbar. J. Chem. Soc. Faraday Trans. 88, 803--826. http://dx.doi.org/10.1039/FT9928800803

YeastGFP project. Yeast GFP Fusion Localization Database, http://yeastgfp.ucsf.edu; Current location: http://yeastgfp.yeastgenome.org