get.expr: Protein Expression Data

Description

Get abundance data from a protein expression experiment and add the proteins to the working instance of CHNOSZ.

Usage

get.expr(file, idcol, abundcol, seqfile, filter=NULL, 
    is.log=FALSE, loga.total = 0)

Arguments

file

character, name of file with sequence IDs and abundance data.

idcol

character, name of the column with sequence IDs.

abundcol

character, name of the column with abundances.

seqfile

character, name of the FASTA file with protein sequences.

filter

list, optional filters to apply.

is.log

logical, are the abundances in the file in logarithmic (base 10) units?

loga.total

numeric, logarithm of total activity of residues.

Value

Returns a list with objects iprotein (the indices of the proteins in thermo$protein) and loga.ref (the logarithms of activities of the proteins).

Details

This function reads a CSV file that contains protein sequence IDs and protein abundance data. The header (first line) of this file contains the column names; the names of the columns holding the sequence IDs and protein abundances are indicated by idcol and abundcol, respectively. The sequence IDs are searched for in the accession lines in the FASTA file indicated by seqfile (using grep); a match can occur in any part of an accession line, and the first such match is used. Any IDs that are NA or can not be found in seqfile are excluded from further consideration. The amino acid compositions of the matched proteins are computed (using read.fasta) and are added to the inventory of proteins in CHNOSZ (thermo$protein).

The function returns values of the logarithms of activities of the proteins. We associate molality with activity (i.e., activity coefficients are implicitly unity). If loga.total is not NULL, the abundances of the proteins from the data file are scaled to give a logarithm of total activity of amino acid residues equal to the value in loga.total, usually set to zero (see unitize). This operation preserves the relative abundances of the proteins. If the abundances of the proteins in the file are already in logarithmic units, set is.log to TRUE.

If seqfile is one of SGD, ECO or HUM it refers to the database of amino acid compositions of proteins packaged with CHNOSZ for either Saccharomyces cerevisiae, Escherichia coli or Homo sapiens. In this case, the search for matching IDs is performed using get.protein.

The data file can be filtered by using filter. This argument should be a list with one element, the name of which indicates the column to apply the filter to, and the value of which is a search term.

Examples

Run this code

data(thermo)
  # let's use a sample data file
  file <- system.file("extdata/abundance/ISR+08.csv",package="CHNOSZ")
  # read the abundances and get the proteins from ECO.csv
  expr <- get.expr(file,"ID","emPAI","ECO")
  # what if we just wanted kinases?
  expr <- get.expr(file,"ID","emPAI","ECO",list(description="kinase"))
  # the abundances were scaled so that the total activity of residues is unity
  pl <- protein.length(-expr$iprotein)
  stopifnot(all.equal(sum(pl*10^expr$loga),1))
  # see the 'protactiv' vignette for comparison with equilibrium calculations

  # if you want to read the protein sequences from a FASTA file...
  # e <- get.expr(file,"ID","emPAI","ECOLI.fasta")