This function reads a CSV file
that contains protein sequence IDs and protein abundance data. The header (first line) of this file contains the column names; the names of the columns holding the sequence IDs and protein abundances are indicated by idcol
and abundcol
, respectively. The sequence IDs are searched for in the accession lines in the FASTA file indicated by seqfile
(using grep
); a match can occur in any part of an accession line, and the first such match is used. Any IDs that are NA or can not be found in seqfile
are excluded from further consideration. The amino acid compositions of the matched proteins are computed (using read.fasta
) and are added to the inventory of proteins in CHNOSZ (thermo$protein
). The function returns values of the logarithms of activities of the proteins. We associate molality with activity (i.e., activity coefficients are implicitly unity). If loga.total
is not NULL, the abundances of the proteins from the data file are scaled to give a logarithm of total activity of amino acid residues equal to the value in loga.total
, usually set to zero (see unitize
). This operation preserves the relative abundances of the proteins. If the abundances of the proteins in the file are already in logarithmic units, set is.log
to TRUE.
If seqfile
is one of SGD, ECO or HUM it refers to the database of amino acid compositions of proteins packaged with CHNOSZ for either Saccharomyces cerevisiae, Escherichia coli or Homo sapiens. In this case, the search for matching IDs is performed using get.protein
.
The data file can be filtered by using filter
. This argument should be a list with one element, the name of which indicates the column to apply the filter to, and the value of which is a search term.