resolveLightChains: Define subgroups within clones based on light chain rearrangements

Description

resolveLightChains resolve light chain V and J subgroups within a clone

Usage

resolveLightChains(
  data,
  nproc = 1,
  minseq = 1,
  locus = "locus",
  heavy = "IGH",
  id = "sequence_id",
  seq = "sequence_alignment",
  clone = "clone_id",
  cell = "cell_id",
  v_call = "v_call",
  j_call = "j_call",
  junc_len = "junction_length",
  nolight = "missing"
)

Value

a tibble containing the same data as inputting, but with the column clone_subgroup added. This column contains subgroups within clones that contain distinct light chain V and J genes, with at most one light chain per cell.

Arguments

data: a tibble containing heavy and light chain sequences with clone_id
nproc: number of cores for parallelization
minseq: minimum number of sequences per clone
locus: name of column containing locus values
heavy: value of heavy chains in locus column. All other values will be treated as light chains
id: name of the column containing sequence identifiers.
seq: name of the column containing observed DNA sequences. All sequences in this column must be multiple aligned.
clone: name of the column containing the identifier for the clone. All entries in this column should be identical.
cell: name of the column containing identifier for cells.
v_call: name of the column containing V-segment allele assignments. All entries in this column should be identical to the gene level.
j_call: name of the column containing J-segment allele assignments. All entries in this column should be identical to the gene level.
junc_len: name of the column containing the length of the junction as a numeric value. All entries in this column should be identical for any given clone.
nolight: string to use to indicate a missing light chain

Details

1. Make temporary array containing light chain clones 2. Enumerate all possible V, J, and junction length combinations 3. Determine which combination is the most frequent 4. Assign sequences with that combination to clone t 5. Copy those sequences to return array 6. Remove all cells with that combination from temp array 7. Repeat 1-6 until temporary array zero. If there is more than rearrangement with the same V/J in the same cell, pick the one with the highest non-ambiguous characters. Cells with missing light chains are grouped with their subgroup with the closest matching heavy chain (Hamming distance) then the largest and lowest index subgroup if ties are present.

Outputs of the function are 1. clone_subgroup which identifies the light chain VJ rearrangement that sequence belongs to within it's clone 2. clone_subgroup_id which combines the clone_id variable and the clone_subgroup variable by a "_". 3. vj_cell which combines the vj_gene and vj_alt_cell columns by a ",".