ClusterMatch

<code>ClusterMatch</code> is a tool for evaluating whether k-means()
clustering models with similar estimated values of k identify similar
clusters. ClusterMatch() also summarizes model stats as means for
different estimated values of k. It is designed to take files produced
by the BootKmeans() function as input, but other data can be analyzed
if the descriptions of the data formats given below are observed
carefully.

Fifteen tools for bioinformatics processing and analysis of major
histocompatibility complex (MHC) data. The functions are tailored for amplicon data
sets that have been filtered using the dada2 method (for more information on
dada2, visit <https://benjjneb.github.io/dada2/> ), but even other types of data
sets can be analyzed.
The ReplMatch() function matches replicates in data sets in order to evaluate
genotyping success.
The GetReplTable() and GetReplStats() functions perform such an evaluation.
The CreateFas() function creates a fasta file with all the sequences in the data
set.
The CreateSamplesFas() function creates individual fasta files for each sample in
the data set.
The DistCalc() function calculates Grantham, Sandberg, or p-distances from pairwise
comparisons of all sequences in a data set, and mean distances of all pairwise
comparisons within each sample in a data set. The function additionally outputs five
tables with physico-chemical z-descriptor values (based on Sandberg et al. 1998) for
each amino acid position in all sequences in the data set. These tables may be useful
for further downstream analyses, such as estimation of MHC supertypes.
The BootKmeans() function is a wrapper for the kmeans() function of the 'stats'
package, which allows for bootstrapping. Bootstrapping k-estimates may be
desirable in data sets, where e.g. BIC- vs. k-values do not produce clear
inflection points ("elbows"). BootKmeans() performs multiple runs of kmeans() and
estimates optimal k-values based on a user-defined threshold of BIC reduction. The
method is an automated and bootstrapped version of visually inspecting elbow plots
of BIC- vs. k-values.
The ClusterMatch() function is a tool for evaluating whether different k-means()
clustering models identify similar clusters, and summarize bootstrap model stats as
means for different estimated values of k. It is designed to take files produced by
the BootKmeans() function as input, but other data can be analysed if the
descriptions of the required data formats are observed carefully.
The PapaDiv() function compares parent pairs in the data set and calculate their
joint MHC diversity, taking into account sequence variants that occur in both
parents.
The HpltFind() function infers putative haplotypes from families in the data
set.
The GetHpltTable() and GetHpltStats() functions evaluate the accuracy of
the haplotype inference.
The CreateHpltOccTable() function creates a binary (logical) haplotype-sequence
occurrence matrix from the output of HpltFind(), for easy overview of which
sequences are present in which haplotypes.
The HpltMatch() function compares haplotypes to help identify overlapping and
potentially identical types.
The NestTablesXL() function translates the output from HpltFind() to an Excel
workbook, that provides a convenient overview for evaluation and curating of the
inferred putative haplotypes.

Jacob Roved

MHCtools

Analysis of MHC Data in Non-Model Species

ClusterMatch function

<dl><dt>filepath</dt>
<dd>a user defined path to a folder that contains the set of
K-cluster files to be matched against each other. The algorithm will attempt
to load all files in the folder, so it should contain only the relevant
K-cluster files. If the clusters were generated using the BootKmeans()
function, such a folder (named Clusters) was created by the algorithm in the
output path given by the user.
Each K-cluster file should correspond to the model$cluster object in kmeans()
saved as a .Rdata file. Such files are generated as part of the output from
BootKmeans(). ClusterMatch() assumes that the file names contain the string
"model_" followed by a model number, which must match the corresponding row
numbers in k_summary_table. If the data used was generated with the
BootKmeans() function, the formats and numbers will match by default.</dd>
<dt>path_out</dt>
<dd>a user defined path to the folder where the output files will
be saved.</dd>
<dt>k_summary_table</dt>
<dd>a data frame summarizing the stats of the kmeans()
models that produced the clusters in the K-cluster files. If the data used
was generated with the BootKmeans() function, a compatible
k_summary_table was produced in the output path with the file name
"k_means_bootstrap_summary_stats_&lt;date&gt;.csv".
If other data is analyzed, please observe these formatting requirements:
The k_summary_table must contain the data for each kmeans() model in rows
and as minimum the following columns:
- k-value (colname: k.est)
- residual total within sums-of-squares (colname: Tot.withinss.resid)
- residual AIC (colname: AIC.resid)
- residual BIC (colname: BIC.resid)
- delta BIC/max BIC (colname: prop.delta.BIC)
- delta BIC/k.est (colname: delta.BIC.over.k)
It is crucial that the models have the same numbers in the K-cluster file
names and in the k_summary_table, and that the rows of the table are ordered
by the model number.</dd></dl>

Arguments

ClusterMatch() function — ClusterMatch

<dl>

<dt>filepath</dt>
<dd>a user defined path to a folder that contains the set of
K-cluster files to be matched against each other. The algorithm will attempt
to load all files in the folder, so it should contain only the relevant
K-cluster files. If the clusters were generated using the BootKmeans()
function, such a folder (named Clusters) was created by the algorithm in the
output path given by the user.
Each K-cluster file should correspond to the model$cluster object in kmeans()
saved as a .Rdata file. Such files are generated as part of the output from
BootKmeans(). ClusterMatch() assumes that the file names contain the string
"model_" followed by a model number, which must match the corresponding row
numbers in k_summary_table. If the data used was generated with the
BootKmeans() function, the formats and numbers will match by default.</dd>


<dt>path_out</dt>
<dd>a user defined path to the folder where the output files will
be saved.</dd>


<dt>k_summary_table</dt>
<dd>a data frame summarizing the stats of the kmeans()
models that produced the clusters in the K-cluster files. If the data used
was generated with the BootKmeans() function, a compatible
k_summary_table was produced in the output path with the file name
"k_means_bootstrap_summary_stats_&lt;date&gt;.csv".
If other data is analyzed, please observe these formatting requirements:
The k_summary_table must contain the data for each kmeans() model in rows
and as minimum the following columns:
- k-value (colname: k.est)
- residual total within sums-of-squares (colname: Tot.withinss.resid)
- residual AIC (colname: AIC.resid)
- residual BIC (colname: BIC.resid)
- delta BIC/max BIC (colname: prop.delta.BIC)
- delta BIC/k.est (colname: delta.BIC.over.k)
It is crucial that the models have the same numbers in the K-cluster file
names and in the k_summary_table, and that the rows of the table are ordered
by the model number.</dd>

</dl>

ClusterMatch: ClusterMatch() function

Description

Usage

Value

Arguments

Details

See Also

Examples