Learn R Programming

canprot (version 1.1.2)

check_IDs: Check UniProt IDs

Description

Find the first ID for each protein that matches a known UniProt ID.

Usage

check_IDs(dat, IDcol, aa_file = NULL, updates_file = NULL)

Value

dat is returned with possibly changed values in the column designated by IDcol; old IDs are replaced with new ones, the first known ID for each protein is kept, then proteins with no known IDs are assigned NA.

Arguments

dat

data frame, protein expression data

IDcol

character, name of column that has the UniProt IDs

aa_file

character, name of file with additional amino acid compositions

updates_file

character, name of file with old to new ID mappings

Details

check_IDs is used to check for known UniProt IDs and to update obsolete IDs. The source IDs should be provided in the IDcol column of dat; multiple IDs for one protein can be separated by a semicolon.

The function keeps the first “known” ID for each protein, which must be present in one of these groups:

  • The human_aa dataset of amino acid compositions.

  • Old UniProt IDs that are mapped to new UniProt IDs in uniprot_updates or in updates_file if specified.

  • IDs of proteins in aa_file, which lists amino acid compositions in the format described for human_aa (see extdata/protein/human_extra.csv for an example and thermo$protein for more details).

See Also

This function is used by the pdat_ functions, where it is called before cleanup.

Examples

Run this code
# Make up some data for this example
ID <- c("P61247;PXXXXX", "PYYYYY;P46777;P60174", "PZZZZZ")
dat <- data.frame(ID = ID, stringsAsFactors = FALSE)
# Get the first known ID for each protein; the third one is NA
check_IDs(dat, "ID")

# Update an old ID
dat <- data.frame(Entry = "P50224", stringsAsFactors = FALSE)
check_IDs(dat, "Entry")

Run the code above in your browser using DataLab