Learn R Programming

canprot (version 1.1.0)

check_IDs: Check UniProt IDs

Description

Find the first ID for each protein that matches a known UniProt ID.

Usage

check_IDs(dat, IDcol, aa_file = NULL, updates_file = NULL)

Arguments

dat

data frame, protein expression data

IDcol

character, name of column that has the UniProt IDs

aa_file

character, name of file with additional amino acid compositions

updates_file

character, name of file with old to new ID mappings

Value

dat is returned with possibly changed values in the column designated by IDcol; old IDs are replaced with new ones, the first known ID for each protein is kept, then proteins with no known IDs are assigned NA.

Details

check_IDs is used to check for known UniProt IDs and to update obsolete IDs. The source IDs should be provided in the IDcol column of dat; multiple IDs for one protein can be separated by a semicolon.

The function keeps the first “known” ID for each protein, which must be present in one of these groups:

  • The human_aa dataset of amino acid compositions.

  • Old UniProt IDs that are mapped to new UniProt IDs in uniprot_updates or in updates_file if specified.

  • IDs of proteins in aa_file, which lists amino acid compositions in the format described for human_aa (see extdata/protein/human_extra.csv for an example and thermo$protein for more details).

See Also

This function is used by the pdat_ functions, where it is called before cleanup.

Examples

Run this code
# NOT RUN {
# Make up some data for this example
ID <- c("P61247;PXXXXX", "PYYYYY;P46777;P60174", "PZZZZZ")
dat <- data.frame(ID = ID, stringsAsFactors = FALSE)
# Get the first known ID for each protein; the third one is NA
check_IDs(dat, "ID")

# Update an old ID
dat <- data.frame(Entry = "P50224", stringsAsFactors = FALSE)
check_IDs(dat, "Entry")
# }

Run the code above in your browser using DataLab