panMatrix: Computing the pan-matrix for a set of gene clusters

Description

A pan-matrix has one row for each genome and one column for each gene cluster, and cell [i,j] indicates how many members genome i has in gene family j.

Usage

panMatrix(clustering)

Arguments

clustering

A named vector of integers.

Value

An integer matrix with a row for each genome and a column for each sequence cluster. The input vector clustering is attached as the attribute clustering.

Details

The pan-matrix is a central data structure for pan-genomic analysis. It is a matrix with one row for each genome in the study, and one column for each gene cluster. Cell [i,j] contains an integer indicating how many members genome i has in cluster j.

The input clustering must be a named integer vector with one element for each sequence in the study, typically produced by either bClust or dClust. The name of each element is a text identifying every sequence. The value of each element indicates the cluster, i.e. those sequences with identical values are in the same cluster. IMPORTANT: The name of each sequence must contain the genome_id for each genome, i.e. they must of the form GID111_seq1, GID111_seq2,... where the GIDxxx part indicates which genome the sequence belongs to. See panPrep for details.

The rows of the pan-matrix is named by the genome_id for every genome. The columns are just named Cluster_x where x is an integer copied from clustering.

Examples

Run this code

# NOT RUN {
# Loading clustering data in this package
data(xmpl.bclst)

# Pan-matrix based on the clustering
panmat <- panMatrix(xmpl.bclst)

# }
# NOT RUN {
# Plotting cluster distribution
library(ggplot2)
tibble(Clusters = as.integer(table(factor(colSums(panmat > 0), levels = 1:nrow(panmat)))),
       Genomes = 1:nrow(panmat)) %>% 
ggplot(aes(x = Genomes, y = Clusters)) +
geom_col()
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab