xCrosstalk: Function to identify a pathway crosstalk

Description

xCrosstalkGenes is supposed to identify maximum-scoring pathway crosstalk from an input graph with the node information on the significance (measured as p-values or fdr). It returns an object of class "cPath".

Usage

xCrosstalk(data, entity = c("Gene", "GR"), significance.threshold =
NULL,
score.cap = NULL, build.conversion = c(NA, "hg38.to.hg19",
"hg18.to.hg19"), crosslink = c("genehancer", "PCHiC_combined",
"GTEx_V6p_combined", "nearby"), crosslink.customised = NULL,
cdf.function = c("original", "empirical"), scoring.scheme = c("max",
"sum", "sequential"), nearby.distance.max = 50000,
nearby.decay.kernel = c("rapid", "slow", "linear", "constant"),
nearby.decay.exponent = 2, networks = c("KEGG", "KEGG_metabolism",
"KEGG_genetic", "KEGG_environmental", "KEGG_cellular",
"KEGG_organismal",
"KEGG_disease", "REACTOME", "PCommonsDN_Reactome"), seed.genes = T,
subnet.significance = 0.01, subnet.size = NULL,
ontologies = c("KEGGenvironmental", "KEGG", "KEGGmetabolism",
"KEGGgenetic",
"KEGGcellular", "KEGGorganismal", "KEGGdisease"), size.range = c(10,
2000),
min.overlap = 10, fdr.cutoff = 0.05, crosstalk.top = NULL,
glayout = layout_with_kk, verbose = T,
RData.location = "http://galahad.well.ox.ac.uk/bigdata")

Arguments

data

a named input vector containing the significance level for genes (gene symbols) or genomic regions (GR). For this named vector, the element names are gene symbols or GR (in the format of 'chrN:start-end', where N is either 1-22 or X, start/end is genomic positional number; for example, 'chr1:13-20'), the element values for the significance level (measured as p-value or fdr). Alternatively, it can be a matrix or data frame with two columns: 1st column for gene symbols or GR, 2nd column for the significance level. Also supported is the input with GR only (without the significance level)

entity

the entity. It can be either "Gene" or "GR"

significance.threshold

the given significance threshold. By default, it is set to NULL, meaning there is no constraint on the significance level when transforming the significance level into scores. If given, those below this are considered significant and thus scored positively. Instead, those above this are considered insignificant and thus receive no score

score.cap

the maximum score being capped. By default, it is set to NULL, meaning that no capping is applied

build.conversion

the conversion from one genome build to another. The conversions supported are "hg38.to.hg19" and "hg18.to.hg19". By default it is NA (no need to do so)

crosslink

the built-in crosslink info with a score quantifying the link of a GR to a gene. See xGR2xGenes for details

crosslink.customised

the crosslink info with a score quantifying the link of a GR to a gene. A user-input matrix or data frame with 4 columns: 1st column for genomic regions (formatted as "chr:start-end", genome build 19), 2nd column for Genes, 3rd for crosslink score (crosslinking a genomic region to a gene, such as -log10 significance level), and 4th for contexts (optional; if nor provided, it will be added as 'C'). Alternatively, it can be a file containing these 4 columns. Required, otherwise it will return NULL

cdf.function

a character specifying how to transform the input crosslink score. It can be one of 'original' (no such transformation), and 'empirical' for looking at empirical Cumulative Distribution Function (cdf; as such it is converted into pvalue-like values [0,1])

scoring.scheme

the method used to calculate seed gene scores under a set of GR (also over Contexts if many). It can be one of "sum" for adding up, "max" for the maximum, and "sequential" for the sequential weighting. The sequential weighting is done via: \(\sum_{i=1}{\frac{R_{i}}{i}}\), where \(R_{i}\) is the \(i^{th}\) rank (in a descreasing order)

nearby.distance.max

the maximum distance between genes and GR. Only those genes no far way from this distance will be considered as seed genes. This parameter will influence the distance-component weights calculated for nearby GR per gene

nearby.decay.kernel

a character specifying a decay kernel function. It can be one of 'slow' for slow decay, 'linear' for linear decay, and 'rapid' for rapid decay. If no distance weight is used, please select 'constant'

nearby.decay.exponent

a numeric specifying a decay exponent. By default, it sets to 2

networks

the built-in network. For direct (pathway-merged) interactions sourced from KEGG, it can be 'KEGG' for all, 'KEGG_metabolism' for pathways grouped into 'Metabolism', 'KEGG_genetic' for 'Genetic Information Processing' pathways, 'KEGG_environmental' for 'Environmental Information Processing' pathways, 'KEGG_cellular' for 'Cellular Processes' pathways, 'KEGG_organismal' for 'Organismal Systems' pathways, and 'KEGG_disease' for 'Human Diseases' pathways. 'REACTOME' for protein-protein interactions derived from Reactome pathways. Pathways Commons pathway-merged network from individual sources, that is, "PCommonsDN_Reactome" for those from Reactome

seed.genes

logical to indicate whether the identified network is restricted to seed genes (ie input genes with the signficant level). By default, it sets to true

subnet.significance

the given significance threshold. By default, it is set to NULL, meaning there is no constraint on nodes/genes. If given, those nodes/genes with p-values below this are considered significant and thus scored positively. Instead, those p-values above this given significance threshold are considered insigificant and thus scored negatively

subnet.size

the desired number of nodes constrained to the resulting subnet. It is not nulll, a wide range of significance thresholds will be scanned to find the optimal significance threshold leading to the desired number of nodes in the resulting subnet. Notably, the given significance threshold will be overwritten by this option

ontologies

the ontologies supported currently. It can be 'AA' for AA-curated pathways, KEGG pathways (including 'KEGG' for all, 'KEGGmetabolism' for 'Metabolism' pathways, 'KEGGgenetic' for 'Genetic Information Processing' pathways, 'KEGGenvironmental' for 'Environmental Information Processing' pathways, 'KEGGcellular' for 'Cellular Processes' pathways, 'KEGGorganismal' for 'Organismal Systems' pathways, and 'KEGGdisease' for 'Human Diseases' pathways), 'REACTOME' for REACTOME pathways or 'REACTOME_x' for its sub-ontologies (where x can be 'CellCellCommunication', 'CellCycle', 'CellularResponsesToExternalStimuli', 'ChromatinOrganization', 'CircadianClock', 'DevelopmentalBiology', 'DigestionAndAbsorption', 'Disease', 'DNARepair', 'DNAReplication', 'ExtracellularMatrixOrganization', 'GeneExpression(Transcription)', 'Hemostasis', 'ImmuneSystem', 'Metabolism', 'MetabolismOfProteins', 'MetabolismOfRNA', 'Mitophagy', 'MuscleContraction', 'NeuronalSystem', 'OrganelleBiogenesisAndMaintenance', 'ProgrammedCellDeath', 'Reproduction', 'SignalTransduction', 'TransportOfSmallMolecules', 'VesicleMediatedTransport')

size.range

the minimum and maximum size of members of each term in consideration. By default, it sets to a minimum of 10 but no more than 2000

min.overlap

the minimum number of overlaps. Only those terms with members that overlap with input data at least min.overlap (3 by default) will be processed

fdr.cutoff

fdr cutoff used to declare the significant terms. By default, it is set to 0.05

crosstalk.top

the number of the top paths will be returned. By default, it is NULL meaning no such restrictions

glayout

either a function or a numeric matrix configuring how the vertices will be placed on the plot. If layout is a function, this function will be called with the graph as the single parameter to determine the actual coordinates. This function can be one of "layout_nicely" (previously "layout.auto"), "layout_randomly" (previously "layout.random"), "layout_in_circle" (previously "layout.circle"), "layout_on_sphere" (previously "layout.sphere"), "layout_with_fr" (previously "layout.fruchterman.reingold"), "layout_with_kk" (previously "layout.kamada.kawai"), "layout_as_tree" (previously "layout.reingold.tilford"), "layout_with_lgl" (previously "layout.lgl"), "layout_with_graphopt" (previously "layout.graphopt"), "layout_with_sugiyama" (previously "layout.kamada.kawai"), "layout_with_dh" (previously "layout.davidson.harel"), "layout_with_drl" (previously "layout.drl"), "layout_with_gem" (previously "layout.gem"), "layout_with_mds", and "layout_as_bipartite". A full explanation of these layouts can be found in http://igraph.org/r/doc/layout_nicely.html

verbose

logical to indicate whether the messages will be displayed in the screen. By default, it sets to true for display

RData.location

the characters to tell the location of built-in RData files. See xRDataLoader for details

Value

an object of class "cPath", a list with following components:

ig_paths: an object of class "igraph". It has graph attribute (enrichment, and/or evidence, gp_evidence and membership if entity is 'GR'), ndoe attributes (crosstalk)
gp_paths: a 'ggplot' object for pathway crosstalk visualisation
gp_heatmap: a 'ggplot' object for pathway member gene visualisation
ig_subg: an object of class "igraph".

Examples

Run this code

# NOT RUN {
# Load the XGR package and specify the location of built-in data
library(XGR)
RData.location <- "http://galahad.well.ox.ac.uk/bigdata/"

# 1) at the gene level
data(Haploid_regulators)
## only PD-L1 regulators and their significance info (FDR)
data <- subset(Haploid_regulators, Phenotype=='PDL1')[,c('Gene','FDR')]
## pathway crosstalk
cPath <- xCrosstalk(data, entity="Gene", network="KEGG",
subnet.significance=0.05, subnet.size=NULL,
ontologies="KEGGenvironmental", RData.location=RData.location)
cPath
## visualisation
pdf("xCrosstalk_Gene.pdf", width=7, height=8)
gp_both <-
gridExtra::grid.arrange(grobs=list(cPath$gp_paths,cPath$gp_heatmap),
layout_matrix=cbind(c(1,1,1,1,2)))
dev.off()

# 2) at the genomic region (SNP) level
data(ImmunoBase)
## all ImmunoBase GWAS SNPs and their significance info (p-values)
ls_df <- lapply(ImmunoBase, function(x) as.data.frame(x$variant))
df <- do.call(rbind, ls_df)
data <- unique(cbind(GR=paste0(df$seqnames,':',df$start,'-',df$end),
Sig=df$Pvalue))
## pathway crosstalk
df_xGenes <- xGR2xGenes(data[as.numeric(data[,2])<5e-8,1],
format="chr:start-end", crosslink="PCHiC_combined", scoring=T,
RData.location=RData.location)
mSeed <- xGR2xGeneScores(data, significance.threshold=5e-8,
crosslink="PCHiC_combined", RData.location=RData.location)
subg <- xGR2xNet(data, significance.threshold=5e-8,
crosslink="PCHiC_combined", network="KEGG", subnet.significance=0.1,
RData.location=RData.location)
cPath <- xCrosstalk(data, entity="GR", significance.threshold=5e-8,
crosslink="PCHiC_combined", networks="KEGG", subnet.significance=0.1,
ontologies="KEGGenvironmental", RData.location=RData.location)
cPath
## visualisation
pdf("xCrosstalk_SNP.pdf", width=7, height=8)
gp_both <-
gridExtra::grid.arrange(grobs=list(cPath$gp_paths,cPath$gp_heatmap),
layout_matrix=cbind(c(1,1,1,1,2)))
dev.off()

# 3) at the genomic region (without the significance info) level
Age_CpG <- xRDataLoader(RData.customised='Age_CpG',
RData.location=RData.location)[-1,1]
CgProbes <- xRDataLoader(RData.customised='CgProbes',
RData.location=RData.location)
ind <- match(Age_CpG, names(CgProbes))
gr_CpG <- CgProbes[ind[!is.na(ind)]]
data <- xGRcse(gr_CpG, format='GRanges')
## pathway crosstalk
df_xGenes <- xGR2xGenes(data, format="chr:start-end",
crosslink="PCHiC_combined", scoring=T, RData.location=RData.location)
subg <- xGR2xNet(data, crosslink="PCHiC_combined", network="KEGG",
subnet.significance=0.1, RData.location=RData.location)
cPath <- xCrosstalk(data, entity="GR", crosslink="PCHiC_combined",
networks="KEGG", subnet.significance=0.1,
ontologies="KEGGenvironmental", RData.location=RData.location)
cPath
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples