Learn R Programming

canprot (version 1.0.0)

cleanup: Clean Up Data

Description

Remove proteins with unavailable IDs, ambiguous expression ratios, and duplicated IDs.

Usage

cleanup(dat, IDcol, up2 = NULL)

Arguments

dat

data frame, protein expression data

IDcol

character, name of column that has the UniProt IDs

up2

logical, TRUE for up-regulated proteins, FALSE for down-regulated proteins

Details

cleanup is used in the pdat_ functions to clean up the dataset given in dat. IDcol is the name of the column that has the UniProt IDs, and up2 indicates the expression change for each protein. The function removes proteins with unavailable (NA or "") or duplicated IDs. If up2 is provided, the function also remove unquantified proteins (those that have NA values of up2) and those with ambiguous expression ratios (up and down for the same ID). For each operation, a message is printed describing the number of proteins that are unavailable, unquantified, ambiguous, or duplicated.

Alternatively, if IDcol is a logical value, it selects proteins to be unconditionally removed.

See Also

This function is used extensively by the pdat_ functions, where it is called after check_IDs (if needed).

Examples

Run this code
# NOT RUN {
# Set up a simple workflow
extdatadir <- system.file("extdata", package="canprot")
datadir <- paste0(extdatadir, "/expression/pancreatic/")
dataset <- "CYD+05"
dat <- read.csv(paste0(datadir, dataset, ".csv.xz"), as.is = TRUE)
up2 <- dat$Ratio..cancer.normal. > 1
# Remove two unavailable and one duplicated proteins
dat <- cleanup(dat, "Entry", up2)
# Now we can calculate the chemical compositions
pcomp <- protcomp(dat$Entry)

# Read another data file
datadir <- paste0(system.file("extdata", package="canprot"), "/expression/colorectal/")
dataset <- "STK+15"
dat <- read.csv(paste0(datadir, "STK+15.csv.xz"), as.is = TRUE)
# Remove unavailable proteins
dat <- cleanup(dat, "uniprot")
# Remove proteins that have less than 2-fold expression ratio
dat <- cleanup(dat, abs(log2(dat$invratio)) < 1)
# }

Run the code above in your browser using DataLab