Data for amino acid compositions of proteins and conversion from old to new UniProt IDs.
human_aa
is a data frame with 25 columns in the format used for amino acid compositions in CHNOSZ (see thermo
):
protein |
character | Identification of protein |
organism |
character | Identification of organism |
ref |
character | Reference key for source of sequence data |
abbrv |
character | Abbreviation or other ID for protein (e.g. gene name) |
chains |
numeric | Number of polypeptide chains in the protein |
The protein
column contains UniProt IDs in the format database|accession-isoform
, where database
is most often sp (Swiss-Prot) or tr (TrEMBL), and isoform
is an optional suffix indicating the isoform of the protein (particularly in the human_additional
file).
The amino acid compositions of human proteins are stored in three files under extdata/protein
.
human_base.rds
contains amino acid compositions of canonical isoforms of manually reviewed proteins in the UniProt reference human proteome (computed from sequences in UP000005640_9606.fasta.gz
, dated 2016-04-03).
human_additional.rds
contains amino acid compositions of additional proteins (UP000005640_9606_additional.fasta.gz
) including isoforms and unreviewed sequences. In version 0.1.5, this file was trimmed to include only those proteins that are used in any of the datasets in the package.
human_extra.csv
contains amino acid compositions of other (“extra”) proteins used in a dataset but not listed in one of the files above. These proteins may include obsolete, unreviewed, or newer additions to the UniProt database. Most, but not all, sequences here are HUMAN (see the organism
column and the ref
column for the reference keys).
On loading the package, the individual data files are read and combined, and the result is assigned to the human_aa
object in the human
environment.
As an aid for processing datasets that list old (obsolete) UniProt IDs, the corresponding new (current) IDs are are stored in uniprot_updates
.
These ID mappings have been manually added as needed for individual datasets, and include proteins from humans as well as other organisms.
check_IDs
performs the conversion of old to new IDs.
Amino acid compositions of non-human proteins are stored under extdata/aa
in directories archaea
, bacteria
, cow
, dog
, mouse
, rat
, and yeast
.
These files can be loaded in protcomp
via the aa_file
argument, which is used e.g. in pdat_osmotic_bact
.
# NOT RUN {
# The number of proteins
nrow(get("human_aa", human))
# The number of old to new ID mappings
nrow(get("uniprot_updates", human))
# }
Run the code above in your browser using DataLab