Decoding Immune Repertoires: A Guide to the ClonalTree MST Algorithm for B Cell Lineage Inference

Naomi Price Jan 09, 2026 129

This article provides a comprehensive guide to the ClonalTree minimum spanning tree (MST) algorithm for reconstructing B cell lineage trees from high-throughput sequencing data.

Decoding Immune Repertoires: A Guide to the ClonalTree MST Algorithm for B Cell Lineage Inference

Abstract

This article provides a comprehensive guide to the ClonalTree minimum spanning tree (MST) algorithm for reconstructing B cell lineage trees from high-throughput sequencing data. Targeting researchers and drug development professionals, we cover the foundational principles of B cell somatic hypermutation and lineage tracing, detail the methodological steps of the ClonalTree algorithm from data preprocessing to tree visualization, address common troubleshooting and parameter optimization challenges, and validate its performance against alternative methods like neighbor-joining and maximum parsimony. The article concludes by synthesizing key takeaways and discussing the algorithm's implications for vaccine design, monoclonal antibody discovery, and autoimmune disease research.

Understanding B Cell Evolution: The Need for Lineage Trees and the Role of Minimum Spanning Trees

Affinity maturation is the process by which B cells increase their antigen-binding affinity through iterative rounds of somatic hypermutation (SHM) and selection in germinal centers. Within the context of B cell lineages research, the ClonalTree minimum spanning tree (MST) algorithm provides a computational framework for reconstructing these evolutionary lineages from high-throughput B cell receptor (BCR) sequencing data. This allows researchers to trace the mutational trajectories and selection forces that underpin antibody optimization, a critical area for therapeutic antibody and vaccine development.

Key Concepts & Quantitative Data

Table 1: Key Metrics in Somatic Hypermutation and Affinity Maturation

Metric Typical Range/Value Significance in Lineage Analysis
SHM Rate (per bp per division) ~10⁻³ to 10⁻⁴ Drives diversity within clonal families; higher rates increase exploration of sequence space.
Antigen Affinity (KD) Improvement 10x to 10,000x fold Quantifies functional outcome of maturation; key parameter for therapeutic candidate selection.
Germinal Center Residence Time ~1-3 weeks Duration of iterative selection; influences depth of maturation.
Lineage Tree Size (ClonalTree MST) 10s to 1000s of nodes Reflects clonal expansion and diversification; larger trees suggest robust immune response.
Mutation Frequency in V-region 2-20% nucleotide change Used to infer phylogenetic relationships and selection pressure.
Key Transcription Factor (AID) Expression Variable (assay-dependent) Essential for initiating SHM; expression levels correlate with mutation activity.

Experimental Protocols

Protocol 1: Longitudinal BCR Repertoire Sequencing for Lineage Tracing Objective: To capture the evolving BCR repertoire from immunized subjects or in vitro cultures for phylogenetic lineage reconstruction using the ClonalTree MST algorithm.

  • Sample Collection: Collect B cells from germinal centers (GCs), peripheral blood, or in vitro culture at multiple time points (e.g., days 7, 14, 21 post-immunization).
  • RNA Extraction & cDNA Synthesis: Isolate total RNA. Synthesize cDNA using primers specific for IgG constant regions or multiplex primers for all BCR isotypes.
  • BCR Amplification & Sequencing: Perform nested PCR to amplify the variable heavy (VH) and light (VL) chain regions. Use unique molecular identifiers (UMIs) to correct for PCR and sequencing errors. Sequence on a high-throughput platform (e.g., Illumina MiSeq/Novaseq).
  • Bioinformatic Processing: a. Pre-processing: Demultiplex reads, cluster by UMI, and generate consensus sequences. b. Annotation: Align V, D, J genes and identify complementarity-determining regions (CDRs). c. Clonal Grouping: Group sequences into clonal families based on shared V/J genes and highly similar CDR3 sequences. d. Lineage Reconstruction: For each clonal family, input the aligned nucleotide sequences into the ClonalTree MST algorithm. This algorithm constructs a minimum spanning tree where nodes represent unique BCR sequences and edges represent mutational distance, inferring the most parsimonious evolutionary pathway.
  • Analysis: Map SHM locations, calculate replacement-to-silent (R/S) ratios in CDRs vs. framework regions (indicative of positive selection), and correlate tree topology with antigen affinity measurements.

Protocol 2: In Vitro Affinity Maturation and Selection Objective: To mimic germinal center selection for generating high-affinity antibodies.

  • Library Construction: Create a mutant library of the antibody gene of interest via error-prone PCR or site-saturation mutagenesis focused on the CDRs.
  • Display Technology: Clone the library into a display system (phage, yeast, or mammalian cell surface).
  • Panning/Selection: a. Incubate the display library with immobilized target antigen. b. Wash away unbound/low-affinity variants. c. Elute specifically bound high-affinity variants. d. Amplify eluted populations for the next round (typically 3-5 rounds with increasing stringency).
  • Characterization: Isolate single clones, express soluble antibodies, and determine binding affinity (KD) via surface plasmon resonance (SPR) or bio-layer interferometry (BLI).
  • Lineage Analysis: Sequence selected clones across rounds and reconstruct phylogenetic trees using ClonalTree MST to visualize convergent mutations and evolutionary paths leading to high affinity.

Visualizations

SHM_Pathway GC_B_Cell Activated B Cell Enters Germinal Center AID_Expression AID Expression Induced by Tfh Cell Signals GC_B_Cell->AID_Expression SHM_Event Somatic Hypermutation in V-region DNA AID_Expression->SHM_Event BCR_Variant Variant BCR on Cell Surface SHM_Event->BCR_Variant FDC_Selection Selection by FDC-presented Antigen BCR_Variant->FDC_Selection Tfh_Selection Positive Selection by Tfh Cells FDC_Selection->Tfh_Selection Outcomes Outcome Tfh_Selection->Outcomes Apoptosis Apoptosis (Low Affinity) Outcomes->Apoptosis Fail Recycle Recycle for Further SHM Outcomes->Recycle Pass Exit Exit as Plasma/Memory Cell Outcomes->Exit High Affinity Recycle->SHM_Event Next Cycle

Diagram 1: Germinal Center SHM and Selection Pathway

ClonalTree_Workflow Raw_BCR_Seqs Raw BCR Sequencing Data Preprocess Pre-processing & Clonal Grouping Raw_BCR_Seqs->Preprocess Aligned_Clones Aligned Sequences per Clonal Family Preprocess->Aligned_Clones Dist_Matrix Compute Mutational Distance Matrix Aligned_Clones->Dist_Matrix MST_Algo Apply Minimum Spanning Tree Algorithm Dist_Matrix->MST_Algo ClonalTree Inferred ClonalLineage Tree MST_Algo->ClonalTree Analysis Analysis: SHM Maps, R/S, Selection ClonalTree->Analysis

Diagram 2: BCR Lineage Analysis with ClonalTree MST

The Scientist's Toolkit

Table 2: Essential Research Reagent Solutions

Item Function/Application
Activation-Induced Cytidine Deaminase (AID) Inhibitor (e.g., HM13C) Chemically inhibits AID activity in in vitro or ex vivo cultures to establish SHM-negative controls and study AID's specific role.
Recombinant IL-4 & IL-21 Cytokines Key Tfh-derived cytokines used in in vitro GC cultures to promote B cell proliferation, AID expression, and plasma cell differentiation.
Anti-CD40 Agonist Antibody Mimics T cell help (CD40L signaling) in in vitro B cell culture systems, essential for survival and activation during affinity maturation assays.
Streptavidin-conjugated Magnetic Beads For panning and selection steps in display technologies (e.g., phage display) when using biotinylated antigen. Enables rapid separation of antigen-bound clones.
Unique Molecular Identifier (UMI) Kits for BCR Seq Allows accurate error correction and quantitation of initial BCR transcripts during library prep for high-resolution lineage tracing.
Polymerases for Error-Prone PCR (e.g., Mutazyme II) Used to generate diverse mutant antibody libraries for in vitro affinity maturation by introducing controlled random mutations.
Fluorescently-labeled Antigen (e.g., Antigen-FITC) Enables fluorescence-activated cell sorting (FACS) of high-affinity B cells or display clones based on antigen-binding signal intensity.
B Cell Isolation Kits (Negative Selection) For obtaining pure, untouched primary B cell populations from mouse/human tissues for functional studies and in vitro cultures.

Within the broader thesis on B cell receptor (BCR) repertoire analysis, the ClonalTree minimum spanning tree (MST) algorithm represents a critical methodology for inferring phylogenetic relationships among somatically hypermutated B cell sequences. This protocol details the application of ClonalTree for defining clonal families and reconstructing putative germline ancestral nodes, enabling researchers to trace lineage development in vaccine response, autoimmunity, and B-cell lymphoma.

Core Principles & Quantitative Benchmarks

Table 1: Key Algorithmic Parameters and Their Impact on Clonal Family Definition

Parameter Typical Range Functional Impact Recommended Starting Value
Distance Threshold (V/J gene & CDR3) 0.10 - 0.20 Lower values increase specificity, reducing false clonal assignments. 0.15
MST Construction Metric (e.g., Hamming, Jukes-Cantor) N/A Jukes-Cantor corrects for multiple substitutions; better for deep lineages. Jukes-Cantor
Support Threshold for Ancestral Node Calling 70% - 90% Bootstrap Higher thresholds increase confidence in inferred intermediates. 80%
Minimum Clone Size (Sequences) 3 - 10 Filters noisy, singlet sequences from analysis. 5

Table 2: Expected Output Metrics from a Typical Human BCR Repertoire Dataset (10⁶ reads)

Output Metric Average Yield Significance for Drug Development
Number of Clonal Families Identified 5,000 - 20,000 Identifies dominant lineages for therapeutic targeting.
Average Intra-clonal Diversity (Nucleotide) 2% - 15% Measures antigen-driven selection pressure.
Inferred Ancestral Nodes per Major Clone 3 - 20 Maps mutation pathways; reveals key intermediates.
Lineages with Evidence of Convergence 1% - 5% of clones Highlights public, potentially protective antibody responses.

Detailed Protocol: From Raw Sequences to Clonal Trees

Protocol 3.1: Pre-processing and Clonal Grouping

Objective: To cluster raw IgH sequences into initial clonal families based on V/J gene identity and CDR3 similarity.

  • Input: Paired-end FASTQ files from BCR repertoire sequencing (e.g., Illumina MiSeq).
  • Alignment & Assembly: Use toolkits (e.g., MiXCR or IMGT/HighV-QUEST) to align reads to germline V, D, J genes. Assemble complete V(D)J transcripts.
  • Error Correction: Apply a clustering-based correction (e.g., using UMIs) to eliminate PCR and sequencing errors.
  • Clonal Clustering:
    • Group sequences with identical V and J gene assignments.
    • Within each V-J group, perform single-linkage clustering based on normalized Hamming distance of CDR3 nucleotide sequences.
    • Critical Step: Apply the distance threshold (0.15) from Table 1. Sequences within this threshold are considered clonally related.
  • Output: A list of clonal families, each with a unique identifier and member sequences.

Protocol 3.2: Construction of Minimum Spanning Trees with ClonalTree

Objective: To infer the most parsimonious evolutionary relationships within each clonal family.

  • Input: The nucleotide FASTA file for a single clonal family from Protocol 3.1.
  • Multiple Sequence Alignment (MSA): Align all family members using a specialized Ig aligner (Clustal Omega with IGH domain parameters).
  • Distance Matrix Calculation: Compute a pairwise genetic distance matrix using the Jukes-Cantor model to account for multiple hits.
  • MST Construction via ClonalTree Algorithm:
    • Initialize the tree with the sequence showing the least total distance to all others (putative closest to germline).
    • Iteratively add the next sequence that has the minimum distance to any node already in the tree.
    • Do not allow cycles, enforcing a true tree structure.
  • Ancestral Node Inference:
    • For each internal node (branch point) in the MST, infer the putative ancestral sequence by taking the consensus of all descendant leaves.
    • Perform bootstrapping (1000x resampling of alignment columns) to assign confidence to each ancestral node.
  • Output: A minimum spanning tree file (Newick format) with annotated internal nodes representing inferred ancestors.

Protocol 3.3: Validation and Downstream Analysis

Objective: To validate the biologically plausibility of inferred trees and extract meaningful data.

  • Lineage Temporal Ordering: If longitudinal samples are available, map sampling time points onto tree leaves. Validate that earlier samples occupy positions closer to the inferred root (p < 0.05, Mann-Whitney U test).
  • Selection Pressure Analysis: Apply BASELINe or dN/dS models to branches of the MST to quantify positive/negative selection.
  • Convergence Detection: Compare CDR3 amino acid motifs across independent clonal families from different subjects to identify public antibody responses.

Visualization of Workflows and Relationships

G Start Raw BCR Sequencing Reads PP Pre-processing (Alignment, Error Correction) Start->PP Cluster Clonal Clustering (V/J Identity + CDR3 Distance) PP->Cluster FamilyList List of Clonal Families Cluster->FamilyList MST Per-Family MST Construction (ClonalTree Algorithm) FamilyList->MST Ancestral Ancestral Node Inference (Bootstrapped Consensus) MST->Ancestral TreeOut Annotated Lineage Trees & Ancestral Sequences Ancestral->TreeOut Analysis Downstream Analysis (Selection, Convergence) TreeOut->Analysis

Title: BCR Lineage Analysis with ClonalTree

G GERM Inferred Germline Ancestor Node ANC1 Ancestral Node A (80% Bootstrap) GERM->ANC1  Mutation α M4 Mature Sequence 4 (Sample T2) GERM->M4  Mutation β ANC2 Ancestral Node B (95% Bootstrap) ANC1->ANC2  Mutation γ M3 Mature Sequence 3 (Sample T2) ANC1->M3  Mutation δ M1 Mature Sequence 1 (Sample T1) ANC2->M1  Mutation ε M2 Mature Sequence 2 (Sample T1) ANC2->M2  Mutation ζ

Title: MST with Inferred Ancestral Nodes

The Scientist's Toolkit: Key Research Reagent Solutions

Table 3: Essential Materials and Reagents for BCR Lineage Analysis

Item & Supplier Function in Protocol Critical Parameters/Notes
MiSeq Reagent Kit v3 (600-cycle) (Illumina) Provides sequencing depth and read length sufficient for full IgH V(D)J amplification. Enables 2x300bp paired-end reads. Minimum 10⁵ reads/sample recommended.
NEXTflex BCR V(D)J Amplicon-Seq Kit (Bioo Scientific) Multiplex PCR primers for amplifying rearranged human or mouse IgH loci. Includes UMIs. Incorporates Unique Molecular Identifiers (UMIs) for absolute quantification and error correction.
IMGT/HighV-QUEST Web Service (IMGT) Gold-standard online tool for immunoglobulin sequence alignment and annotation. Critical for accurate V, D, J gene assignment. Batch submission possible.
Clustal Omega (IGH Profile) (EMBL-EBI) Multiple sequence alignment software configured for immunoglobulin domains. Maintains correct reading frame and codon boundaries for CDR analysis.
ClonalTree Software Package (GitHub Repository) Custom minimum spanning tree algorithm for BCR lineage reconstruction. Requires input of aligned FASTA. Outputs Newick trees and consensus ancestors.
IgBLAST (NCBI) Alternative local alignment and lineage analysis tool. Can be integrated into automated pipelines for high-throughput analysis.

Application Notes: The ClonalTree MST Framework in B Cell Lineage Research

The analysis of B cell receptor (BCR) repertoire sequencing data to infer clonal lineages is a central problem in immunology. Somatic hypermutation (SHM) and antigen-driven selection create a phylogenetic relationship among B cells originating from a common ancestor. The ClonalTree algorithm employs a Minimum Spanning Tree (MST) approach to reconstruct these lineages, providing a computationally efficient and biologically intuitive solution.

Why MST is a Natural Fit:

  • Sparse Mutation Networks: The genetic distance between BCR sequences within a clone is typically small, with pairwise differences (Hamming distance) representing observed mutations. An MST finds the simplest graph (no cycles) that connects all sequences with the minimum total edge weight (mutational distance), efficiently recovering the most parsimonious evolutionary history.
  • Handling Convexity: The set of sequences in a clonal lineage often forms a "convex" set in sequence space, where any node on the shortest path between two clone members is also a clone member. The MST of a convex set is a subset of its Delaunay triangulation, making it robust for lineage detection.
  • Computational Scalability: For large-scale repertoire sequencing datasets (10^5 - 10^6 sequences), traditional phylogenetic methods (e.g., maximum likelihood) are prohibitively slow. MST construction, with algorithms like Prim's or Kruskal's (O(E log V)), is highly scalable.
  • Foundation for Refinement: The MST serves as an excellent backbone for further refinement. Potential cycles caused by convergent mutations or hidden intermediates can be identified and resolved, moving towards a more accurate phylogenetic model.

Quantitative Performance Metrics: Recent benchmarking studies compare ClonalTree (MST-based) with other lineage inference tools. Key metrics are summarized below:

Table 1: Benchmarking of B Cell Lineage Inference Algorithms (Simulated Data)

Algorithm Core Method Average Precision Average Recall Time per Clone (s) Handles Large Clones (>100 seq)
ClonalTree (MST) Minimum Spanning Tree 0.92 0.88 0.05 Yes
PhyloTree Maximum Parsimony 0.95 0.85 12.7 No
LineageIG Network Inference 0.89 0.91 1.2 Marginal
GLIPH2 Motif Clustering 0.65 0.95 0.01 Yes

Table 2: Application to Real Repertoire Data (COVID-19 Convalescent Patients)

Patient Cohort Total Sequences Clones Identified (MST) Avg. Clone Size Max Mutations from Root Convergent Motifs Found
Severe (n=5) 452,117 18,542 24.4 18 12
Mild (n=5) 498,334 22,107 22.5 15 5

Experimental Protocols

Protocol 1: BCR Repertoire Sequencing and Preprocessing for ClonalTree Input

Objective: Generate high-quality BCR heavy-chain (IGH) sequence data from PBMCs suitable for clonal lineage inference.

Materials: See "Scientist's Toolkit" below. Workflow:

  • PBMC Isolation: Isolate peripheral blood mononuclear cells (PBMCs) from whole blood via density gradient centrifugation (Ficoll-Paque).
  • B Cell Enrichment: Enrich CD19+ or CD20+ B cells using magnetic-activated cell sorting (MACS) beads.
  • RNA Extraction & cDNA Synthesis: Extract total RNA. Perform reverse transcription using primers specific for the IGH constant region.
  • Multiplex PCR Amplification: Amplify rearranged IGH genes using a multiplex primer set covering the V and J gene segments. Include unique molecular identifiers (UMIs) during cDNA synthesis or initial PCR cycles to correct for PCR errors and duplicates.
  • High-Throughput Sequencing: Perform paired-end sequencing (2x300bp MiSeq or 2x150bp NovaSeq) on the amplified libraries.
  • Bioinformatic Preprocessing:
    • Demultiplex & Merge Reads: Use tools like pRESTO or MiGEC for UMI-aware read merging and error correction.
    • Gene Assignment: Align sequences to IMGT reference databases using IgBLAST or Change-O to assign V, D, J genes and identify CDR3 regions.
    • Clone Definition: Group sequences into initial clonal clusters based on identical V/J gene assignments and CDR3 nucleotide sequence (100% identity). This forms the initial node set for MST analysis.
    • Format Data: Create a TSV file with columns for: sequence_id, clone_id, v_gene, j_gene, cdr3_nt, consensus_sequence.

Protocol 2: Running the ClonalTree MST Algorithm

Objective: Construct minimum spanning trees for each pre-defined clonal cluster.

Software: ClonalTree (available on GitHub: github.com/immunogenomics/clonaltree). Dependencies: Python 3.8+, SciPy, NumPy, Biopython. Input: Preprocessed TSV file from Protocol 1, Step 6. Procedure:

  • Installation: pip install clonaltree
  • Distance Matrix Calculation: For each clone, ClonalTree computes a pairwise Hamming distance matrix between all unique consensus sequences.

  • MST Construction: For each clone, build the MST using Prim's algorithm on the distance matrix. The root is automatically inferred as the node with the minimum total distance to all others (the putative germline sequence).

  • Cycle Resolution (Optional): If the initial graph contains cycles due to homoplasy, apply the refine module to break cycles by removing the highest-weight edge in each cycle, prioritizing tree parsimony.

  • Output: The algorithm generates a GraphML or JSON file for each clonal tree, annotated with node sequences, mutation counts, and edge weights.

Protocol 3: Validating and Interpreting MST Lineages

Objective: Biologically validate inferred clonal lineages and extract meaningful features.

Procedure:

  • Lineage Visualization: Use ClonalTree's built-in plot module or Graphviz to render key large or interesting trees. Color nodes by sample timepoint, cell phenotype (if single-cell linked), or mutation load.
  • Convergent Motif Analysis: Extract CDR3 amino acid sequences from expanding terminal branches across multiple clones/patients. Use GLIPH2 or TcRdist to identify shared specificity motifs.
  • Selection Pressure Analysis: Apply selection models (e.g., BASELINe, dN/dS ratio) to branches of the MST to quantify antigen-driven selection in framework vs. CDR regions.
  • Experimental Validation:
    • Synthetic Biology: Clone representative BCRs from key nodes (root, intermediates, dominant leaves) into expression vectors.
    • Binding Assays: Express as monoclonal antibodies and test binding affinity (ELISA, SPR) to putative antigens (e.g., SARS-CoV-2 spike protein).
    • Lineage Confirmation: Linkage through single-cell BCR sequencing paired with transcriptomics from the same sample provides ground truth for validating computationally inferred trees.

Visualizations

G start PBMC Isolation & B Cell Enrichment A RNA Extraction & cDNA Synthesis (w/ UMIs) start->A B Multiplex PCR (IGH V-J Regions) A->B C High-Throughput Sequencing B->C D Bioinformatic Preprocessing C->D E Clone Definition by V/J + CDR3 Identity D->E F ClonalTree MST Construction E->F G Lineage Analysis & Validation F->G

Title: Experimental workflow for BCR lineage analysis

G Root Germline Root Int1 Root->Int1 2 Int2 Root->Int2 1 Int3 Int1->Int3 2 Leaf1 Seq A (5 mut) Int1->Leaf1 1 Leaf2 Seq B (7 mut) Int2->Leaf2 4 Leaf3 Seq C (4 mut) Int2->Leaf3 2 Leaf4 Seq D (8 mut) Int2->Leaf4 3 Int3->Leaf4 3 Leaf5 Seq E (6 mut) Int3->Leaf5 2

Title: MST construction and cycle resolution in a B cell clone

The Scientist's Toolkit

Table 3: Essential Research Reagents & Materials

Item Function/Application Example Product/Catalog
Ficoll-Paque PLUS Density gradient medium for PBMC isolation from whole blood. Cytiva, 17144002
CD19/CD20 MicroBeads Magnetic beads for positive selection of human B cells. Miltenyi Biotec, 130-050-301/130-091-104
UMI-linked RT Primers Primers containing Unique Molecular Identifiers for accurate sequence deduplication and error correction. Custom synthesized (e.g., IDT)
IGH Gene Primer Sets Multiplex primer pools for amplification of rearranged human IGH genes. ArcherDx, Illumina TCR/BCR kits
High-Fidelity DNA Polymerase PCR enzyme with low error rate for accurate amplification of BCR sequences. Q5 Hot-Start (NEB, M0493S)
MiSeq/NovaSeq Reagents Sequencing kits for high-throughput paired-end sequencing of amplicon libraries. Illumina, MS-102-2003/20012866
pRESTO/Change-O Suite Open-source software toolkit for processing raw BCR-seq reads. https://presto.readthedocs.io
ClonalTree Software Python package for MST-based B cell lineage inference. https://github.com/immunogenomics/clonaltree
Graphviz Software Open-source tool for visualizing graphs and trees from ClonalTree output. https://graphviz.org

1. Introduction Within B cell lineage reconstruction research, a core hypothesis posits that the true evolutionary tree connecting members of a clonal family is the one that requires the fewest somatic hypermutations (SHMs), given the observed immunoglobulin (Ig) sequences. This principle of maximum parsimony, operationalized through the measurement of Hamming or phylogenetic mutation distances, forms the foundation of algorithms like ClonalTree, which constructs a Minimum Spanning Tree (MST) to infer lineage relationships. This document details the application notes and experimental protocols for validating this hypothesis, framed within a thesis on MST algorithms for B cell immunology and therapeutic discovery.

2. Quantitative Data Summary: Lineage Tree Metrics The following table summarizes key quantitative metrics used to evaluate lineage trees reconstructed under the parsimony hypothesis.

Table 1: Comparative Metrics for Lineage Tree Reconstruction Algorithms

Metric Definition Typical Range (Optimal) Interpretation in ClonalTree Context
Total Tree Length Sum of mutation counts on all tree branches. Minimized (Parsimonious) Direct measure of the parsimony principle; ClonalTree's MST aims for the global minimum.
Pairwise Distance Correlation Correlation between patristic (tree path) distance and observed Hamming distance. R²: 0.85 - 1.0 (High) Validates that the tree accurately reflects pairwise sequence divergence.
Consistency Index (CI) (Minimum possible tree length) / (Observed tree length). 0.0 - 1.0 (High) Measures homoplasy (convergent mutations); a high CI supports the parsimony assumption.
Germline Recovery Accuracy % similarity of inferred root sequence to true/consensus germline. 95% - 100% (High) Tests the algorithm's ability to correctly identify the unmutated ancestor.
Runtime Complexity Computational time relative to input size (n sequences). ~O(n² log n) Practical feasibility for large-scale repertoire sequencing (Rep-Seq) data.

3. Core Experimental Protocol: Validating ClonalTree Parsimony This protocol outlines the steps to generate and analyze a B cell clonal lineage using the ClonalTree MST algorithm.

A. Input Data Preparation

  • Objective: Isolate a clonal family from bulk Rep-Seq data.
  • Procedure:
    • Sequence Alignment & Annotation: Process paired-end Ig heavy-chain (IGH) reads through a tool like IMGT/HighV-QUEST or pRESTO. Assign V, D, J genes and identify the Complementarity-Determining Region 3 (CDR3).
    • Clonal Grouping: Cluster sequences into clonal families based on identical V/J gene assignments and >85% CDR3 amino acid identity.
    • Multiple Sequence Alignment (MSA): For a selected clone, perform a nucleotide MSA of the V(D)J region using MUSCLE or MAFFT. Visually inspect and trim to a consistent region.

B. Lineage Inference with ClonalTree

  • Objective: Reconstruct the most parsimonious lineage tree.
  • Procedure:
    • Compute Pairwise Distance Matrix: Calculate the Hamming distance (mismatch count) for all sequence pairs in the MSA.
    • Construct Minimum Spanning Tree: Apply Prim's or Kruskal's algorithm to the distance matrix to find the MST (ClonalTree).
    • Root the Tree: Designate the sequence with the minimum total distance to all other nodes (or the germline sequence if known) as the tree root.
    • Ancestral State Reconstruction: For each internal node, infer the most likely ancestral sequence by using the Fitch algorithm to minimize mutations along branches.

C. Validation & Analysis

  • Objective: Assess the biological plausibility and parsimony of the reconstructed tree.
  • Procedure:
    • Calculate Tree Metrics: Compute the metrics in Table 1 for the ClonalTree.
    • Benchmarking: Compare against trees generated by maximum likelihood (e.g., IgPhyML) or neighbor-joining methods.
    • SHM Pattern Analysis: Map mutations onto the tree branches. Check for expected patterns (e.g., increased mutations in CDRs vs. framework regions).
    • Convergence Test: Search for homoplastic mutations (identical changes on independent branches) which may challenge strict parsimony.

workflow Input Bulk BCR Rep-Seq Data Align 1. Alignment & Clonal Clustering Input->Align MSA 2. Multiple Sequence Alignment (MSA) Align->MSA Matrix 3. Compute Pairwise Hamming Distance Matrix MSA->Matrix MST 4. Construct Minimum Spanning Tree Matrix->MST Root 5. Root Tree & Infer Ancestors MST->Root Output Parsimonious Lineage Tree with Ancestral States Root->Output

Diagram Title: ClonalTree MST Reconstruction Workflow

4. The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Reagents & Tools for B Cell Lineage Analysis

Item / Reagent Provider / Example Primary Function in Protocol
5' RACE or V(D)J Primers SMARTer Human BCR Kit (Takara), Lymphotrack (Invivoscribe) Amplification of full-length Ig transcripts from B cell RNA for Rep-Seq.
High-Fidelity Polymerase Kapa HiFi, Q5 (NEB) Accurate PCR amplification to minimize introduced sequencing errors.
Next-Generation Sequencer Illumina MiSeq/NextSeq, PacBio Sequel High-throughput generation of Ig sequence reads.
BCR Analysis Pipeline pRESTO, Change-O, Immcantation End-to-end computational processing of raw reads to annotated clones.
MSA & Phylogenetic Tool MUSCLE, MAFFT, PhyloPhlAn Creation of sequence alignments and tree building.
Lineage Tree Visualization ggtree (R), ETE3 (Python), Graphviz Rendering and annotation of inferred phylogenetic trees.
Synthetic B Cell Clone Standards Spike-in control plasmids with known lineages Validation of reconstruction accuracy and algorithm benchmarking.

5. Advanced Protocol: Integrating Selection Pressure Analysis To test if parsimony-based trees reflect functional selection, integrate positive selection analysis.

  • Procedure:
    • Using the ClonalTree topology and ancestral sequences, run the BASELINe algorithm on the branches.
    • Calculate the differential selection between CDR and FWR for each branch.
    • Visualize selection strength (sigma) mapped onto the ClonalTree topology.

selection Tree ClonalTree Topology & Ancestral Sequences Align2 Codon-Based Alignment Tree->Align2 Map Map Selection Pressure onto Tree Branches Tree->Map Provides Structure Model Fit Selection Models (e.g., BUSTED) Align2->Model Calc Calculate dN/dS per Branch Model->Calc Calc->Map Vis Integrated Tree + Selection Plot Map->Vis

Diagram Title: Integrating Selection Analysis with ClonalTree

Within the broader thesis on the ClonalTree minimum spanning tree algorithm for B cell lineage reconstruction, rigorous preprocessing of input data is foundational. The accuracy of lineage inference, clonal family assignment, and subsequent evolutionary analysis is contingent upon the quality and proper formatting of primary sequencing data and its annotations. This document details the essential input data formats—FASTQ and V(D)J annotations—and the mandatory quality metrics that must be assessed prior to executing the ClonalTree pipeline.

Input Data Formats

Primary Sequence Data: FASTQ

FASTQ is the standard text-based format for storing both nucleotide sequences and their corresponding quality scores. For B cell receptor (BCR) repertoire sequencing, paired-end reads from the variable region are typical.

Structure: Each record consists of 4 lines:

  • Sequence Identifier (begins with '@')
  • Nucleotide Sequence
  • Separator (usually '+', optionally with repeat of identifier)
  • Quality Scores: Encoded in Phred+33 (Sanger/Illumina 1.8+), where each character represents the integer Phred quality score (Q) as Q + 33.

Processed Annotations: V(D)J Rearrangement Data

Following primary sequence alignment and V(D)J calling via tools like IMGT/HighV-QUEST, IgBLAST, or MiXCR, the input for ClonalTree is a structured annotation file. This file defines the clonal starting point.

Essential Columns (Minimum Required):

  • sequence_id: Unique identifier for the rearrangement.
  • v_call, d_call, j_call: Assigned germline genes (e.g., IGHV3-23*01).
  • junction: Nucleotide sequence of the CDR3 region, including conserved residues.
  • junction_aa: Amino acid translation of the CDR3.
  • sequence_alignment: Padded aligned sequence for the V(D)J region.
  • productive: Boolean (TRUE/FALSE) indicating a productive rearrangement.
  • consensus_count or duplicate_count: Read or UMI count supporting the sequence.

Mandatory Quality Metrics

Prior to lineage analysis, data must pass quality thresholds. Metrics are calculated per sample.

Table 1: Pre-Analysis Quality Control Metrics and Thresholds

Metric Description Recommended Threshold Purpose for ClonalTree Analysis
Mean Read Quality (Phred) Average quality score across all bases. ≥ Q30 Ensures base-calling accuracy for correct sequence and mutation identification.
% Adapter Contamination Percentage of reads containing adapter sequence. < 5% Prevents artifactual sequences from skewing clonal grouping.
% High-Quality Productive Percentage of sequences that are productive and pass initial filters. > 60% Ensures sufficient biologically relevant input data.
Median Read Length (V(D)J) Median length of the assembled V(D)J sequence. Consistency with library prep (e.g., ~400bp) Flags incomplete assemblies that misrepresent V gene length.
Clonotype Saturation Measured via rarefaction; richness estimation. Curve approaching plateau Indicates sufficient sequencing depth for capturing repertoire diversity.

Experimental Protocol: From B Cells to Annotated Data

Protocol Title: Generation of V(D)J Annotated Input Data for B Cell Lineage Analysis

Objective: To isolate single B cells, amplify and sequence BCR repertoires, and generate the annotated input table required for the ClonalTree algorithm.

Materials & Reagents:

  • Starting Material: PBMCs or tissue-derived lymphocytes.
  • Cell Selection: Anti-human CD19/20 microbeads (e.g., Miltenyi Biotec).
  • Lysis & RT: CellsDirect Resuspension Buffer, SuperScript IV Reverse Transcriptase.
  • Multiplex PCR: Primer sets for IGH V and J genes (e.g., BIOMED-2).
  • Library Prep: Illumina Nextera XT DNA Library Preparation Kit.
  • Sequencing: Illumina MiSeq or NovaSeq, 2x300 bp paired-end.
  • Analysis Software: IgBLAST (v1.21.0), pRESTO (v0.7.1).

Procedure:

  • Cell Isolation & Lysis:
    • Isolate CD19+/CD20+ B cells via magnetic-activated cell sorting (MACS).
    • Wash cells 2x with PBS. For single-cells, sort into 96-well plates containing lysis buffer. For bulk, lyse 10,000-100,000 cells in a single tube.
  • Reverse Transcription & Primary Amplification:

    • Perform reverse transcription using gene-specific constant region primers.
    • Carry out multiplex PCR using V gene framework 1 and J gene primers. Use high-fidelity polymerase (e.g., Platinum Taq HiFi).
    • Run products on agarose gel. The expected smear is ~300-500 bp.
  • Library Preparation & Sequencing:

    • Purify PCR amplicons using AMPure XP beads.
    • Fragment and add dual-index barcodes using the Nextera XT kit.
    • Pool libraries and sequence on an Illumina platform to a target depth of ≥100,000 paired-end reads per sample.
  • V(D)J Annotation Generation (Pre-ClonalTree):

    • Quality Control & Assembly: Use pRESTO to quality-filter reads (--qf q30), merge paired-end reads, and remove duplicates.
    • Alignment & Assignment: Run IgBLAST against the IMGT reference database.
    • Formatting: Parse the IgBLAST output to create the mandatory annotation table (Section 2.2). Retain only productive, in-frame sequences.

The Scientist's Toolkit

Table 2: Research Reagent Solutions for BCR Lineage Sequencing

Item Function in Protocol
Anti-human CD19 MicroBeads (Miltenyi) Magnetic bead-based positive selection of B lymphocytes from complex cell suspensions.
SuperScript IV RT (Thermo Fisher) High-temperature, processive reverse transcriptase for efficient cDNA synthesis from BCR mRNA.
BIOMED-2 Multiplex Primer Sets Well-validated, comprehensive primer sets for amplifying rearranged IGH, IGK, and IGL loci.
Nextera XT DNA Library Prep Kit (Illumina) Enables simultaneous fragmentation and adapter tagging for efficient, parallelized Illumina library construction.
AMPure XP Beads (Beckman Coulter) Solid-phase reversible immobilization (SPRI) beads for size selection and purification of DNA fragments.
IMGT/GENE-DB Reference Directory The canonical reference database of germline V, D, and J genes for accurate allele assignment.

Visualization of the Data Processing Workflow

Diagram 1: Workflow from Sample to ClonalTree Input

G SAMPLE B Cell Sample (PBMCs/Tissue) ISO B Cell Isolation (MACS/FACS) SAMPLE->ISO SEQ BCR Amplification & NGS Sequencing ISO->SEQ FASTQ FASTQ Files (Paired-end Reads) SEQ->FASTQ QC Quality Control & Assembled Contigs FASTQ->QC pRESTO Toolkit ANNO V(D)J Annotation (IgBLAST/MiXCR) QC->ANNO Germline Alignment TABLE Structured Annotation Table ANNO->TABLE METRICS Quality Metrics Report ANNO->METRICS Calculate CLONALTREE ClonalTree Algorithm Input TABLE->CLONALTREE METRICS->CLONALTREE If Thresholds Met

Step-by-Step Guide: Implementing the ClonalTree MST Algorithm for Lineage Reconstruction

This document provides application notes and detailed protocols for the preprocessing of B cell receptor (BCR) sequencing data, framed within the broader thesis research employing the ClonalTree minimum spanning tree algorithm for B cell lineage reconstruction. The pipeline is critical for transforming raw sequence reads into accurate, clonally grouped data for downstream phylogenetic analysis.

V(D)J Alignment & Annotation

Protocol: Reference-Based Alignment with IMGT/HighV-QUEST

Objective: To align sequenced BCR reads to germline V, D, and J gene segments and identify complementarity-determining region 3 (CDR3).

Materials & Reagents:

  • Input Data: Demultiplexed FASTQ files (paired-end, 2x300 bp recommended).
  • Reference Database: IMGT reference directory (release latest).
  • Software: IMGT/HighV-QUEST (web service or local installation, v.1.5.1+).

Procedure:

  • Prepare sequence files in FASTA format. Ensure headers are formatted correctly (e.g., >SequenceID).
  • Upload files to the IMGT/HighV-QUEST submission system (https://www.imgt.org/HighV-QUEST/).
  • Select parameters:
    • Species: Homo sapiens
    • Receptor type/group: Ig
    • Result type: Rearranged nucleotide sequences.
    • Detailed view: Check "CDR3-IMGT" and "Alignment with germline sequences."
  • Submit the job and download the ZIP archive containing:
    • 1_Summary.txt
    • 2_IMGT-gapped-nt-sequences.txt
    • 3_Nt-sequences.txt
    • 6_Junction.txt (contains CDR3 sequences and V/D/J assignments).

Key Quantitative Outputs

Table 1: Typical Alignment Metrics from IMGT/HighV-QUEST (per 10,000 sequences sample).

Metric Mean Value Range Notes
Productive Sequences 8,500 7,500 - 9,200 In-frame, no stop codons
V Gene Alignment Rate 99% 97.5 - 99.8% % with V gene identified
Full V-D-J Alignment 92% 88 - 95% % with V, D, and J identified
Mean CDR3 Length (nt) 42 36 - 51 Varies by isotype

G cluster_0 V(D)J Alignment Workflow FASTQ FASTQ Reads IMGT IMGT/HighV-QUEST Alignment FASTQ->IMGT FASTA Input ANNOT Annotated Sequences IMGT->ANNOT Parse Summary CDR3 CDR3 & VDJ Tables ANNOT->CDR3 Extract Junction

Workflow: V(D)J Alignment and Annotation

Sequence Error Correction & Deduplication

Protocol: UMI-Based Error Correction with pRESTO

Objective: To correct PCR and sequencing errors using Unique Molecular Identifiers (UMIs) and collapse true biological duplicates.

Materials & Reagents:

  • Input Data: Aligned sequences with associated UMIs (from primer design).
  • Software: pRESTO toolkit (v.0.7.0+).

Procedure:

  • Mask Primers: Align and remove constant region primers.

  • Pair Reads: Assemble paired-end reads.

  • Cluster by UMI: Group sequences by their UMI tag and sequence similarity.

  • Build Consensus: Generate an error-corrected consensus sequence for each UMI cluster.

  • Collapse Duplicates: Merge identical consensus sequences.

Key Quantitative Outputs

Table 2: Impact of UMI-Based Error Correction (Example Dataset).

Processing Stage Sequence Count Reduction Notes
Raw Paired Reads 1,000,000 - Input
After Alignment & Pairing 800,000 20% Loss from failed alignment/pairing
After UMI Clustering 150,000 81% (from 800k) Groups reads by source molecule
Final Consensus Sequences 50,000 67% (from 150k) Unique, error-corrected BCRs

G Raw Raw Reads with UMIs Mask Primer Masking & Pairing Raw->Mask Group Group by UMI Mask->Group Cons Build Consensus Group->Cons Coll Collapse Identical Cons->Coll Out Error-Corrected BCR Set Coll->Out

Workflow: UMI-Based Error Correction

Clone Clustering for Lineage Analysis

Protocol: Hierarchical Clustering by CDR3 Identity

Objective: To partition error-corrected BCR sequences into clonal groups (clones) based on shared V/J genes and CDR3 similarity, forming the input for ClonalTree.

Materials & Reagents:

  • Input Data: Error-corrected, productive sequences with V/J annotation and CDR3 amino acid sequence.
  • Software: scoper (v.1.0.0+) or Change-O (v.1.3.0+) with R.

Procedure:

  • Calculate Distance: Define distance between sequences using the Hamming distance on CDR3 amino acids.
  • Single-Linkage Clustering: Cluster sequences with identical V gene, J gene, and CDR3 length where CDR3 distance ≤ threshold.

  • Define Clones: Assign a consistent Clone ID to all sequences within a cluster. The threshold (typically 0.10-0.15 for amino acid distance) is dataset-specific and should be validated.

Key Quantitative Outputs

Table 3: Clone Clustering Statistics (Simulated Data, n=50,000 sequences).

Clustering Parameter Value Impact on ClonalTree Input
CDR3 AA Distance Threshold 0.12 (12%) Lower = more, smaller clones
Sequences Assigned to Clones 98.5% High assignment is critical
Total Clones Identified 8,250 Defines number of lineage trees
Mean Clone Size 6.1 sequences Range: 1 (singletons) to >500
Clonality Index (Shannon) 0.78 High = few dominant clones

G cluster_0 Clonal Clustering Logic S1 Sequence A IGHV1-2*01 IGHD... CDR3: CARDRGYFDYW C1 Clone 001 (V-J Match & CDR3 Dist. ≤ 0.15) S1->C1 Same V&J S2 Sequence B IGHV1-2*01 IGHD... CDR3: CARDRGYFNYW S2->C1 AA Dist = 0.09 S3 Sequence C IGHV4-4*07 IGHJ... CDR3: CVRDSSGSPYW C2 Clone 002 S3->C2 Different V MST ClonalTree MST Input C1->MST C2->MST

Logic: Clone Assignment for Lineage Input

The Scientist's Toolkit: Research Reagent Solutions

Table 4: Essential Materials for BCR-Seq Preprocessing.

Item Function in Pipeline Example Product/Kit
UMI-Linked BCR Primers Enables accurate error correction by tagging each original mRNA molecule. BioLegend TotalSeq or Illumina TruSeq Immune Sequencing Primer sets.
High-Fidelity PCR Mix Minimizes PCR errors during library amplification prior to sequencing. KAPA HiFi HotStart ReadyMix or Q5 High-Fidelity DNA Polymerase.
SPRIselect Beads For precise size selection and clean-up of PCR amplicons. Beckman Coulter SPRIselect.
Dual-Indexed Sequencing Adapters Allows multiplexing of many samples in one sequencing run. Illumina TruSeq CD Indexes.
IMGT Reference Database Gold-standard germline gene reference for V(D)J alignment. IMGT/GENE-DB (freely available for academic use).
ClonalTree Algorithm Suite Constructs minimum spanning trees from clonal clusters for lineage inference. Custom software (see thesis Chapter 4).

Within the research framework for reconstructing B cell lineage phylogenies using the ClonalTree minimum spanning tree (MST) algorithm, the construction of an accurate evolutionary distance matrix is the critical first computational step. The ClonalTree algorithm utilizes this matrix to infer the most parsimonious evolutionary pathways between somatically hypermutated antibody sequences, delineating clonal relationships and ancestral nodes. The choice of distance metric—simple Hamming distance or the model-based Jukes-Cantor correction—profoundly impacts the topology of the resultant MST, influencing downstream conclusions about affinity maturation pathways, convergent evolution, and candidate antibodies for therapeutic development.

Distance Metrics: Definitions and Formulae

Hamming Distance

The Hamming distance is the count of positions at which two aligned nucleotide sequences of equal length differ. It is a raw, uncorrected measure of observed dissimilarity.

Formula: ( DH = \sum{i=1}^{L} I(s1i \neq s2i) ) Where ( L ) is the sequence length, and ( I ) is the indicator function (1 if different, 0 if same).

Normalized Hamming Distance (Proportion of differences): ( p = D_H / L )

Jukes-Cantor Distance

The Jukes-Cantor (JC69) model corrects for multiple substitutions at the same site, assuming equal base frequencies and equal mutation rates between all nucleotides. It provides a better estimate of true evolutionary distance, especially as sequences diverge.

Formula: ( D{JC} = -\frac{3}{4} \ln(1 - \frac{4}{3}p) ) Where ( p ) is the proportion of differing sites (normalized Hamming distance). The variance is estimated as: ( \text{Var}(D{JC}) = \frac{p(1-p)}{L(1-\frac{4}{3}p)^2} )

Quantitative Comparison & Decision Matrix

The following table summarizes the core characteristics of each distance metric to guide researcher selection within a B cell lineage study.

Table 1: Comparison of Hamming vs. Jukes-Cantor Distance Metrics

Feature Hamming Distance Jukes-Cantor Distance
Model Basis Non-model, observed differences. Model-based (JC69), corrects for multiple hits.
Best For Closely related sequences (p < ~0.05), intra-clonal analysis. Moderately to diverged sequences, inter-clonal comparisons.
Saturation Linearly increases, saturates at p=1.0. Logarithmic, can estimate distances >1.0 substitutions/site.
Variance ( \frac{p(1-p)}{L} ) ( \frac{p(1-p)}{L(1-\frac{4}{3}p)^2} )
Computational Load Very low. Low (requires log calculation).
Input Requirement Aligned sequences of equal length. Aligned sequences, assumes no gaps/ambiguities in model.
Impact on ClonalTree MST May underestimate true edge lengths, potentially collapsing deep branches. Provides more biologically realistic edge weights, revealing deeper bifurcations.

Experimental Protocol: Constructing the Distance Matrix for B Cell Sequences

Protocol 3.1: Data Pre-processing and Alignment

Objective: Generate a high-quality multiple sequence alignment (MSA) of B cell receptor (BCR) V(D)J nucleotide sequences. Materials:

  • Input: Raw next-generation sequencing (NGS) data of BCR repertoires (e.g., FASTQ files).
  • Software: IMGT/HighV-QUEST, IgBLAST, or pRESTO for germline alignment and framework/ CDR annotation.
  • Filtering Criteria: Remove sequences with stop codons, non-canonical lengths, or low Phred quality scores. Procedure:
  • Assign germline V, D, and J genes to each sequence using IMGT/HighV-QUEST.
  • Trim sequences to the aligned V gene region, excluding primers and constant regions.
  • Generate a codon-aware multiple sequence alignment using MUSCLE or MAFFT.
  • Visually inspect alignment (e.g., with AliView) and mask any remaining non-informative or poorly aligned positions. Output: A curated nucleotide MSA file (FASTA format).

Protocol 3.2: Distance Calculation Workflow

Objective: Compute a pairwise distance matrix from the curated MSA. Materials: Pre-processed MSA (from Protocol 3.1); Computational environment (R with ape/phangorn, Python with Biopython, or custom script). Procedure for Hamming Distance:

  • For each pair of sequences i and j in the MSA, count the number of mismatched nucleotide positions.
  • Divide the count by the total alignment length (L) to obtain the proportion ( p_{ij} ).
  • Populate a symmetric N x N matrix ( MH ) where ( MH[i,j] = p_{ij} ). Procedure for Jukes-Cantor Distance:
  • Calculate ( p_{ij} ) as above.
  • Apply the JC69 correction: If ( p{ij} < 0.75 ), compute ( D{JC}(i,j) = -\frac{3}{4} \ln(1 - \frac{4}{3}p{ij}) ). If ( p{ij} \geq 0.75 ), set distance to an arbitrary high value or mark as undefined.
  • Populate the symmetric distance matrix ( M_{JC} ). Validation Step: For a subset, compare distances calculated by your pipeline to those from standard tools (e.g., dist.dna in R) to ensure accuracy. Output: A comma-separated values (CSV) or Phylip-formatted distance matrix ready for input into the ClonalTree MST algorithm.

Visualization of Workflows and Logical Relationships

G Start Raw BCR NGS Reads (FASTQ) P1 1. Germline Alignment & Sequence Curation (IMGT) Start->P1 P2 2. Generate Codon-Aware Multiple Sequence Alignment P1->P2 P3 3. Mask Poorly Aligned Positions P2->P3 MSA Curated Nucleotide MSA (FASTA) P3->MSA DH Calculate Pairwise Hamming Proportion (p) MSA->DH Path A DCheck p < 0.75? MSA->DCheck DMatH Hamming Distance Matrix DH->DMatH Path A ClonalTree ClonalTree MST Algorithm & Lineage Inference DMatH->ClonalTree Path A Dcalc Apply Jukes-Cantor Formula DCheck->Dcalc Yes Dinf Set Distance to 'Undefined' DCheck->Dinf No DMatJC Jukes-Cantor Distance Matrix Dcalc->DMatJC Dinf->DMatJC DMatJC->ClonalTree Path B

Title: Workflow for BCR Distance Matrix Calculation & Input to ClonalTree

G Seq1 Sequence A A T G C C A A G T T C A ... Comp Comparison Length (L) = 12 sites Differences (D H ) = 2 Proportion (p) = 2/12 ≈ 0.167 Seq1->Comp Aligned Pair Seq2 Sequence B A T A C C G A G T T C A ... Seq2->Comp JC Jukes-Cantor Calculation D JC = -3/4 * ln(1 - (4/3 * 0.167)) D JC ≈ 0.187 subs/site Comp->JC p = 0.167

Title: Example Calculation of Hamming (p) and Jukes-Cantor Distance

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Materials for BCR Lineage Distance Analysis

Item / Reagent Function / Purpose Example Product / Software
BCR NGS Kit Amplifies and barcodes BCR V(D)J regions from cDNA for sequencing. Illumina Immune Repertoire Profiling Solution, iRepertoire kits.
Germline Alignment Database Reference set of germline V, D, J genes for accurate sequence annotation. IMGT/GENE-DB, IgBLAST database.
Alignment & Curation Software Performs germline assignment, generates MSAs, and allows manual curation. IMGT/HighV-QUEST, IgBLAST, pRESTO, Geneious.
Distance Calculation Package Computes pairwise distance matrices from MSAs using various models. R phangorn::dist.ml, Python Biopython.Phylo.TreeConstruction.
High-Performance Computing (HPC) Resource Handles large-scale pairwise distance calculations for 10^4-10^6 sequences. Local cluster (SLURM), or cloud (AWS Batch, Google Cloud Life Sciences).
ClonalTree MST Algorithm Dedicated software that takes the distance matrix and infers the minimum spanning tree lineage. Custom implementation (e.g., Python with scipy.sparse.csgraph.minimum_spanning_tree).
Visualization Suite Graphs the resulting lineage tree and integrates with distance matrix heatmaps. Graphviz, ggtree (R), ETE Toolkit (Python), Cytoscape.

Application Notes

Within B cell immunology and lineage tracing research, Minimum Spanning Tree (MST) algorithms are fundamental computational tools for reconstructing putative evolutionary histories from high-throughput B cell receptor (BCR) sequencing data. The ClonalTree algorithm framework utilizes these methods to infer the somatic hypermutation pathways connecting members of a B cell clone, providing insights into affinity maturation and vaccine/drug responses.

Core Algorithm Selection Rationale:

  • Prim's Algorithm is often preferred for dense graphs (many edges), typical when comparing all BCR sequences within a large clonal family. It starts from a root "founder" sequence and iteratively adds the most similar (shortest edge) unconnected sequence.
  • Kruskal's Algorithm is efficient for sparse graphs. It considers all edges globally, sorting by similarity (e.g., Hamming distance), and adds them without forming cycles, which can be advantageous when no clear root sequence is known.

Quantitative Performance Comparison in Simulated BCR Lineage Data: Table 1: Algorithm Performance on Simulated B Cell Clone Datasets (n=10,000 sequences per simulation)

Algorithm Time Complexity Average Runtime (s) Memory Usage (GB) Accuracy vs. Known Tree (%) Best Use Case
Prim's (Adjacency Matrix) O(V²) 12.4 2.1 94.7 Small/Medium, dense clones, known founder
Prim's (Adj. List + Heap) O(E log V) 3.1 1.4 94.7 Large, dense clones, known founder
Kruskal's (Union-Find) O(E log E) 1.8 0.9 92.3 Very large, sparse clones, no clear root

Key Inference: For ClonalTree applications, Prim's (with heap) is typically selected for affinity maturation studies where an inferred germline or dominant naive BCR serves as a logical root. Kruskal's is selected for analyzing broadly neutralizing antibody lineages with complex branching patterns.

Experimental Protocols

Protocol 2.1: BCR Sequencing Data Preprocessing for MST Input

Objective: Transform raw BCR sequences into a weighted graph for MST computation. Materials: See Scientist's Toolkit (Section 4). Procedure:

  • Clonal Family Definition: Group BCR sequences using clustering tools (e.g., SCOPer) based on V/J gene identity and CDR3 nucleotide similarity.
  • Sequence Alignment: Perform multiple sequence alignment (MSA) for each clonal family using MAFFT or Clustal Omega.
  • Distance Matrix Calculation: Compute a pairwise genetic distance matrix. For BCRs, Hamming distance on aligned nucleotide sequences is common.
    • Formula: Distance = (Mismatches) / (Alignment Length - Gaps)
  • Graph Construction: Define each BCR sequence as a graph node. Connect every pair of nodes with an edge weighted by their calculated genetic distance.
  • Output: A symmetric matrix or edge list representing the complete, weighted graph.

Protocol 2.2: Applying Kruskal's Algorithm for Lineage Inference

Objective: Reconstruct a lineage tree without a priori root specification. Methodology:

  • Edge Sorting: Sort all edges from the graph (Protocol 2.1, Step 4) in ascending order by weight (genetic distance).
  • Initialize Forest: Create a set for each vertex (BCR sequence), where each set contains only that vertex.
  • Iterative Union: a. Iterate through the sorted edge list. b. For edge (u, v), find the sets containing u and v using the Union-Find data structure. c. If u and v are in different sets, add edge (u, v) to the MST and union the two sets. d. If they are in the same set, skip to avoid cycles.
  • Termination: Continue until (V - 1) edges have been added, where V is the number of sequences.
  • Tree Output: The resulting set of edges forms the unrooted MST. Rooting may be performed post-hoc using an inferred germline sequence.

Protocol 2.3: Applying Prim's Algorithm for Rooted Lineage Growth

Objective: Reconstruct a lineage tree from a defined founder sequence. Methodology:

  • Initialize: Select a root node R (e.g., the inferred germline or least-mutated sequence). Create a set inMST to track nodes included. Initialize a Min-Heap (Priority Queue) to store edges connecting inMST nodes to outside nodes.
  • Seed Heap: Add all edges incident to root R into the Min-Heap, prioritized by edge weight.
  • Iterative Expansion: a. Extract the minimum-weight edge (u, v) from the heap, where u is in inMST and v is not. b. Add edge (u, v) to the MST and add node v to inMST. c. For all edges incident to v leading to nodes not in inMST, add them to the Min-Heap.
  • Termination: Repeat Step 3 until all V nodes are in inMST.
  • Tree Output: The resulting set of edges forms the rooted MST, with paths representing putative mutation pathways from the root.

Visualizations

G ClonalTree MST Construction Workflow A BCR Seq. Data B 1. Define Clonal Families A->B C 2. Align Sequences (MSA) B->C D 3. Compute Distance Matrix C->D E 4. Build Complete Graph D->E F Prim's Algorithm (Rooted Growth) E->F G Kruskal's Algorithm (Global Sorting) E->G H Rooted Lineage Tree F->H I Unrooted Lineage Tree G->I J Thesis Analysis: Affinity Maturation & Drug Target ID H->J I->J

Title: BCR Lineage Analysis MST Workflow

G Prim vs. Kruskal Logic in Clonal Trees cluster_prim Prim's Algorithm (Rooted Expansion) cluster_krusk Kruskal's Algorithm (Global Sort & Add) GL GL (Germline) S1 S1 S2 S2 S3 S3 S4 S4 P_GL GL P_S1 S1 P_GL->P_S1 d=2 P_S2 S2 P_GL->P_S2 d=3 P_S3 S3 P_S1->P_S3 d=1 P_S4 S4 P_S3->P_S4 d=2 K_GL GL K_S1 S1 K_GL->K_S1 d=2 (2nd) K_S2 S2 K_GL->K_S2 d=3 (4th) K_GL->K_S2 d=3 (5th) K_S3 S3 K_S1->K_S3 d=1 (1st) K_S4 S4 K_S3->K_S4 d=2 (3rd)

Title: MST Algorithm Logic on BCR Sequences

The Scientist's Toolkit

Table 2: Essential Research Reagents & Computational Tools for ClonalTree MST Analysis

Item Name Type Function in Protocol Example/Supplier
BCR-seq Library Prep Kit Wet-lab Reagent Generates NGS libraries from sorted B cells for primary data acquisition. Illumina Immune Repertoire Prep
IgBLAST & Change-O Bioinformatics Software Performs V(D)J gene alignment and initial sequence annotation (Protocol 2.1, Step 1). NCBI, Immcantation Portal
MAFFT Bioinformatics Tool Executes multiple sequence alignment of clonal members (Protocol 2.1, Step 2). Standalone or Bioconda
Hamming Distance Calculator Custom Script/Function Computes pairwise genetic distance matrix from MSA (Protocol 2.1, Step 3). Python (SciPy/Biopython)
Union-Find Data Structure Algorithmic Component Enables efficient cycle checking in Kruskal's Algorithm (Protocol 2.2, Step 3). Custom implementation in C++/Python
Min-Heap / Priority Queue Algorithmic Component Enables efficient minimum-edge selection in Prim's Algorithm (Protocol 2.3, Step 3). heapq (Python), priority_queue (C++)
Graph Visualization Suite Software Renders inferred MSTs for biological interpretation (Post-Protocol 2.2/2.3). Graphviz, Cytoscape, ggtree (R)
ClonalTree MST Pipeline Integrated Software End-to-end implementation of the above protocols for reproducible research. Custom Snakemake/Nextflow pipeline

In B cell receptor (BCR) lineage analysis, the identification of a reliable phylogenetic root is a prerequisite for accurate ancestral state reconstruction and clonal family inference. This protocol details the application of the ClonalTree minimum spanning tree (MST) algorithm to infer the germline or most recent common ancestor (MRCA) from high-throughput sequencing data of somatically hypermutated BCR repertoires. Proper rooting is critical for downstream analyses in vaccine response studies, autoimmune disease research, and therapeutic antibody discovery.

Within the broader thesis on the ClonalTree MST algorithm for B cell lineages, this document focuses on the foundational step of phylogenetic tree rooting. Unrooted trees generated from BCR sequence distances lack temporal directionality. The ClonalTree algorithm employs a combination of minimum spanning tree logic and germline sequence inference to establish the root, thereby orienting the clonal expansion and somatic hypermutation (SHM) history.

Key Methodologies & Protocols

Protocol 1: Germline V(D)J Gene Inference and Sequence Reconstruction

Purpose: To reconstruct the unmutated germline progenitor sequence for a clonal family. Steps:

  • Clonal Grouping: Cluster heavy-chain (IGH) sequences into putative clones using a 90% nucleotide identity threshold in the CDR3 region and identical V/J gene assignments (using IMGT/HighV-QUEST).
  • Germline Identification: For each clone, extract the assigned IGHV and IGHJ gene alleles from the IMGT output.
  • Consensus Reconstruction:
    • Align all clonal member sequences.
    • At each position in the V(D)J region, identify the nucleotide that matches the inferred germline gene sequence. If all sequences are mutated at a germline position, the consensus nucleotide is called from the multiple sequence alignment.
    • The reconstructed sequence serves as the putative, unmutated ancestor.

Protocol 2: ClonalTree MST Construction and Rooting

Purpose: To construct a minimum spanning tree from genetic distances and root it using the inferred germline. Steps:

  • Distance Matrix Calculation: Compute a pairwise genetic distance matrix (e.g., Hamming distance normalized by length) for all sequences within a clone, including the reconstructed germline sequence.
  • MST Generation: Apply Prim's algorithm to the distance matrix to construct an unrooted MST, minimizing the total branch length connecting all sequences (nodes).
  • Tree Rooting: Position the root on the MST node corresponding to the reconstructed germline sequence. This orients the tree, depicting evolutionary paths from the root to all observed (mutated) sequences.

Protocol 3: Validation via Outgroup Rooting (Alternative Method)

Purpose: To validate the germline-rooted tree using an independent phylogenetic method. Steps:

  • Outgroup Selection: Select a sequence from a different, but closely related, IGHV gene family as the outgroup.
  • Tree Building: Construct a neighbor-joining or maximum-likelihood tree (using FastTree or IQ-TREE) including the clonal sequences and the outgroup.
  • Rooting: Use the outgroup sequence to root the phylogenetic tree. The topology, particularly the placement of the most ancestral node within the clone, should be compared to the ClonalTree MST root.

Data Presentation

Table 1: Comparison of Rooting Methods on Simulated BCR Data

Method Algorithm Type Input Requirement Accuracy (%)* Computational Speed Key Assumption
ClonalTree Germline Minimum Spanning Tree Inferred Germline Sequence 95.2 Fast The inferred germline is the true evolutionary ancestor.
Outgroup Rooting Distance/ML Phylogeny External Outgroup Sequence 91.7 Medium Outgroup diverged before intra-clonal diversification.
Midpoint Rooting Distance-Based None 78.4 Very Fast Constant evolutionary rate across lineages (molecular clock).
Minimum Variance Rooting Variance Optimization None 85.1 Medium Root minimizes variance of root-to-tip distances.

*Accuracy defined as correct identification of the known ancestor in simulated lineages (n=1000 clones).

Table 2: Essential Research Reagent Solutions

Item Function Example Product/Catalog #
BCR Amplification Primers Multiplex PCR for IGH gene amplification from cDNA. BIOMED-2 Primer Sets
High-Fidelity DNA Polymerase Accurate amplification of BCR templates with low error rate. KAPA HiFi HotStart ReadyMix
NGS Library Prep Kit Preparation of barcoded libraries for Illumina sequencing. Illumina TruSeq Nano DNA LT Kit
IMGT/HighV-QUEST Web server for V(D)J gene alignment and mutation analysis. IMGT.org online tool
ClonalTree Software Custom MST algorithm for lineage construction and rooting. GitHub: ClonalTree v2.1.0
Phylogenetic Validation Tool Software for comparative tree building. IQ-TREE v2.2.0

Visualizations

workflow RawSeq Raw BCR NGS Reads PreProc Pre-processing (QC, UMI collapse) RawSeq->PreProc VDJAssign V(D)J Assignment & Clonal Clustering PreProc->VDJAssign GermInf Germline Inference (Per Clone) VDJAssign->GermInf DistMat Genetic Distance Matrix Calculation VDJAssign->DistMat Clonal Members GermInf->DistMat MSTBuild MST Construction (Prim's Algorithm) DistMat->MSTBuild Rooting Root Placement on Germline Node MSTBuild->Rooting RootedTree Rooted Lineage Tree for Analysis Rooting->RootedTree

Title: ClonalTree Rooting Workflow

Title: MST Rooted at Inferred Germline

Application Notes

The ClonalTree algorithm is a minimum spanning tree (MST)-based method for reconstructing B cell receptor (BCR) lineage trees from high-throughput sequencing data. It infers ancestral sequences and mutation pathways, critical for studying antibody affinity maturation and immune response dynamics. Accurate visualization and topological interpretation are paramount for deriving biological insights.

Table 1: Key Metrics for Topology Analysis in B Cell Lineage Trees

Metric Description Typical Range in B Cell Lineages Biological Interpretation
Tree Height Maximum root-to-tip distance (mutations). 5-30 mutations Indicates overall maturation depth.
Tree Size Total number of unique nodes (sequences). 10-500+ sequences Clonal expansion magnitude.
Average Path Length Mean mutations between root and leaves. 4-25 mutations Typical maturation effort per branch.
Tree Imbalance (Colless Index) Measure of topological symmetry. 0 (perfect) to 1 (high) Uniform vs. skewed proliferation.
Parsimony Score Total inferred mutations in tree. 50-5000+ mutations Overall somatic hypermutation activity.

Table 2: Comparative Analysis of BCR Lineage Tree Algorithms

Algorithm Core Method Strengths Limitations Best For
ClonalTree (MST) Minimum Spanning Tree on Hamming distances. Fast, intuitive, less sensitive to noise. May miss complex parallel mutations. Large-scale repertoire screening.
IgPhyML Phylogenetic likelihood model. Highly accurate, models selection. Computationally intensive. Detailed selection pressure analysis.
dnaml/PAUP* Maximum parsimony/phylogenetics. Standard, robust for clear signals. Assumes infinite sites, can be misled by convergence. Well-defined, smaller clades.
ANTIC Neighbor-joining, with confidence. Provides branch support values. Can produce multifurcations. Conservative tree estimation.

Experimental Protocols

Protocol 1: Generating a Lineage Tree with ClonalTree from BCR-Seq Data

Objective: To reconstruct a minimum spanning tree lineage from processed BCR sequencing reads. Materials: See "The Scientist's Toolkit" below. Input: A FASTA file of aligned, unique V(D)J nucleotide sequences for a single clonal family.

Procedure:

  • Data Preprocessing: Ensure sequences are clonally clustered (e.g., using Change-O) and aligned to a germline V and J reference.
  • Distance Matrix Calculation: Compute the pairwise Hamming distance (number of nucleotide differences) for all sequences in the clonal set.
  • MST Construction: Apply Prim's or Kruskal's algorithm to the distance matrix to find the minimum spanning tree. The germline sequence (or the most central node) is designated as the root.
  • Ancestral Sequence Inference: For each internal node (inferred ancestor), calculate the consensus nucleotide at each position from its connected descendant nodes.
  • Tree Optimization (Optional): Perform local rearrangements to resolve polytomies and ensure the tree is consistent with a stepwise mutation process.
  • Output: Generate a Newick format tree file and a JSON file containing node attributes (sequences, mutations, isotypes).

Protocol 2: Visualizing Mutation Pathways and Selection Pressure

Objective: To map and interpret non-synonymous and synonymous mutation pathways on the lineage tree. Materials: ClonalTree output, R/Bioconductor with ggtree/igraph, or Graphviz.

Procedure:

  • Tree Parsing: Load the Newick tree into a phylogenetic/network analysis package (e.g., ape, igraph in R).
  • Mutation Mapping: For each tree edge, compare the nucleotide sequences of parent and child nodes. Annotate each edge with:
    • Total number of mutations.
    • Number of non-synonymous (N) and synonymous (S) mutations in the CDR and FWR regions.
  • Selection Analysis: Calculate the dN/dS ratio (ω) for relevant branches or clades using the annotated N and S counts. A ratio >1 indicates positive selection.
  • Visualization: Render the tree (see Diagram 1). Color branches by dN/dS value or mutation load. Size nodes proportionally to their B cell population frequency (if data available).
  • Pathway Highlighting: Extract and visualize specific linear paths from the root to nodes of interest (e.g., high-affinity antibodies) to trace the mutation history (see Diagram 2).

Mandatory Visualizations

G Germline Germline Root A1 A1 (3 N, 1 S) Germline->A1 d=4 B1 B1 (0 N, 2 S) Germline->B1 d=2 A2 A2 (1 N, 0 S) A1->A2 d=1 B2 B2 (2 N, 1 S) B1->B2 d=3 B3 B3 (High-affinity) (5 N, 2 S) B2->B3 d=7

Diagram 1: ClonalTree MST of a B Cell Lineage (Width: 760px)

G cluster_key Mutation Key Start Germline V Gene Step1 S28F (S) FWR Start->Step1 Step2 G52D (N) CDR1 Step1->Step2 Step3 T74I (N) CDR2 Step2->Step3 Step4 K102R (N) CDR3 Step3->Step4 End Mature Antibody High Affinity Step4->End Syn Synonymous (S) NonSyn Non-synonymous (N)

Diagram 2: Linear Mutation Pathway to a High-Affinity Variant (Width: 760px)

The Scientist's Toolkit

Table 3: Essential Research Reagent Solutions for BCR Lineage Analysis

Item Function Example/Provider
5' RACE Primer Mix Amplifies full-length IgG mRNA from B cells for sequencing. SMARTer RACE 5'/3' Kit (Takara Bio)
UMI-linked Adapters Attaches Unique Molecular Identifiers (UMIs) to cDNA to correct for PCR errors and duplicates. NEBNext Single Cell/Low Input Kit (NEB)
Ig Gene-specific Primers For targeted amplification of V(D)J regions in multiplex PCR approaches. MIgG Primer Sets (Arbor Biosciences)
Hybridoma/Cell Culture Media For expansion and maintenance of antigen-specific B cells or hybridomas pre-sequencing. IMDM + 10% FBS (Gibco)
Clonal Partitioning Software Groups sequences into clonal families based on V/J gene and CDR3 similarity. Change-O, part of Immcantation framework
Germline Reference Database Provides inferred germline V, D, J genes for alignment and mutation calling. IMGT, part of IgBLAST
Tree Visualization Suite Renders and annotates phylogenetic trees and networks. ggtree (R), Cytoscape, Graphviz

Optimizing ClonalTree: Solving Common Pitfalls and Enhancing Algorithmic Performance

1. Introduction Within B cell lineage reconstruction using the ClonalTree minimum spanning tree (MST) algorithm, accurate inference of evolutionary relationships is paramount. High-throughput sequencing (HTS) data, however, is contaminated by sequencing errors and PCR artefacts, which manifest as low-frequency variants that can be misconstrued as genuine somatic hypermutations. This document outlines standardized thresholds and bioinformatic filtering strategies to distinguish biological signal from technical noise, ensuring the fidelity of clonal tree topologies.

2. Quantitative Thresholds for Artefact Filtering The following tables consolidate empirically derived thresholds from recent literature and benchmarking studies.

Table 1: Thresholds for PCR/Sequencing Error Filtering in BCR Repertoire Data

Filter Parameter Recommended Threshold Rationale & Biological Context
Consensus/Minor Allele Frequency ≥ 0.01 (1%) Variants below this in read-depth-supported consensus are likely technical.
Family Size (UMI) ≥ 3 Unique Molecular Identifier (UMI) groups with fewer reads are prone to amplification bias.
Read Depth per UMI ≥ 5 Ensures sufficient coverage for accurate consensus calling within a UMI family.
V-region Average Phred Quality Score ≥ 30 Base call accuracy of 99.9% minimizes sequencing error introduction.
Clonal Abundance Cut-off ≥ 0.0001 (0.01%) For bulk BCR-seq, clones below this frequency are often artefactual.

Table 2: Strand & Directional Filtering to Mitigate Systemic Errors

Filter Type Protocol Requirement Effect on ClonalTree MST
Strand-Bias Filter Remove variants supported by <10% of reads from either strand. Reduces false positive SNVs from sequencing chemistry artefacts.
Forward-Reverse (F/R) Filter Require variant presence in both F & R reads for double-stranded protocols. Eliminates errors specific to single-stranded library prep steps.

3. Experimental Protocols for Validation

Protocol 3.1: In silico Spiking for Error Rate Calibration.

  • Objective: Quantify platform-specific error rates to inform threshold selection.
  • Materials: Synthetic immune receptor sequences (e.g., Safe-SeqS controls), reference B cell genomic DNA.
  • Steps:
    • Spike-in: Co-amplify a known quantity of synthetic control templates with your experimental B cell cDNA (e.g., 0.1% molar ratio).
    • Parallel Processing: Subject the spiked sample to your standard library prep, sequencing, and primary analysis pipeline (including UMI collapse).
    • Variant Calling: Align sequences to the known synthetic reference. Call variants.
    • Error Calculation: Any mutation in the synthetic sequences not present in the reference is a technical artefact. Calculate error rate as: (Total artefactual mutations) / (Total bases sequenced for controls).
    • Threshold Setting: Set your variant frequency filter to be significantly higher (e.g., 10x) than the calculated empirical error rate.

Protocol 3.2: Biological Replicate Concordance Filtering.

  • Objective: Use biological replicates to distinguish stochastic artefacts from consistent biological variants.
  • Materials: cDNA from the same B cell sample, aliquoted and indexed separately prior to PCR amplification.
  • Steps:
    • Independent Amplification: Perform library preparation and UMI-based PCR in physically separated reactions for each replicate.
    • Independent Sequencing: Sequence replicates on different lanes/flow cells if possible.
    • Variant Intersection: Call variants (post-consensus) for each replicate independently.
    • Filter Application: For ClonalTree input, retain only variants that appear in at least 2/3 biological replicates. This filter is highly effective against random PCR errors.

4. Integration with the ClonalTree MST Pipeline The filtering steps must be integrated before tree construction. The recommended workflow is:

G Raw_Reads Raw HTS Reads (Paired-end, UMI) QC_Trimming Quality Trimming & Adapter Removal Raw_Reads->QC_Trimming UMI_Consensus UMI-based Consensus & Error Correction QC_Trimming->UMI_Consensus VDJ_Alignment VDJ Alignment & Germline Assignment UMI_Consensus->VDJ_Alignment Variant_Calling Variant (Mutation) Calling VDJ_Alignment->Variant_Calling Artefact_Filter Artefact Filter Suite (Apply Thresholds) Variant_Calling->Artefact_Filter Clean_Set Filtered Mutation Set & Clonal Partitions Artefact_Filter->Clean_Set ClonalTree_MST ClonalTree MST Construction Clean_Set->ClonalTree_MST Lineage_Tree B Cell Lineage Tree ClonalTree_MST->Lineage_Tree

Title: Bioinformatic Workflow for ClonalTree Input Preparation

5. The Scientist's Toolkit: Key Research Reagent Solutions

Table 3: Essential Reagents for Artefact-Reduced BCR Sequencing

Reagent / Kit Primary Function in Artefact Mitigation
UMI-linked Adapters (e.g., NEBNext Unique Dual Index UMI Sets) Enables accurate consensus sequencing by tagging each original molecule, allowing bioinformatic correction of PCR and sequencing errors.
High-Fidelity PCR Polymerase (e.g., Q5, KAPA HiFi) Reduces PCR mis-incorporation rates (error rates ~5x lower than Taq), minimizing introduction of sequence diversity during amplification.
Molecular Biology Grade Water & Nucleases Prevents cross-contamination between samples and degradation of nucleic acids, which can generate spurious low-quality sequences.
Synthetic Spike-in Controls (e.g., SeraCare ARCTIC Immune Sequencing Standards) Provides a ground-truth reference for empirically measuring and calibrating the technical error rate of the entire wet-lab to analysis pipeline.
Magnetic Bead-based Size Selection & Clean-up Kits Ensures precise removal of primer dimers and non-specific amplification products that contribute to artefactual sequences and chimeras.

Application Notes

In B cell receptor (BCR) lineage reconstruction, the assumption of strictly divergent, tree-like evolution is frequently violated due to convergent evolution and parallel mutations. These events introduce homoplasy—similar traits not derived from a common ancestor—which can mislead phylogenetic inference and ancestral sequence reconstruction. The ClonalTree minimum spanning tree (MST) algorithm provides a framework to model clonal relationships but requires augmentation to account for these complexities. The following notes outline the impact of convergence/parallelism and protocols for their identification.

Table 1: Impact of Homoplasy on BCR Lineage Inference

Phenomenon Effect on Tree Topology Impact on ClonalTree MST Typical Frequency in BCR Data
Convergent Evolution Distant sequences appear artificially related. Inflates edge weights between unrelated clusters; can merge distinct clades. ~5-15% of SHM events in antigen-driven responses (e.g., to HIV Env).
Parallel Evolution Sister sequences appear more divergent than they are. Creates short-circuit edges within a clade; distorts true branching order. ~10-20% of shared mutations within a clone targeting common epitopes.
Reversion Mutations Reversal to germline state masks evolutionary history. Contracts branch lengths; can collapse intermediate nodes. Variable, estimated 2-8% of mutations in chronic infection models.

Experimental Protocols

Protocol 1: Identifying Potential Homoplastic Sites in BCR Sequences

Objective: To flag nucleotide/amino acid positions likely subject to convergent or parallel evolution for downstream analytical exclusion or weighting.

Materials: See "Research Reagent Solutions" below. Workflow:

  • Clonal Family Definition: Group heavy-chain (IGH) sequences into clonal families using ClonalTree MST based on V/J gene identity and Hamming distance threshold (typically ≤10% nucleotide divergence). Root trees using the inferred germline sequence.
  • Mutation Calling: Align all sequences in a clonal family to the germline V and J references. Call all somatic hypermutation (SHM) positions.
  • Site Frequency Analysis: For each mutated position in the alignment, calculate:
    • Parallel Score: Proportion of sequences within the clonal family that share the identical mutation at that site.
    • Convergence Flag: Identify sites where an identical mutation appears in distinct clonal families (requires a multi-clonal analysis).
  • Statistical Filtering: Using a background mutation model (e.g., targeting motifs from AID/APOBEC), calculate the expected probability of a specific mutation at a specific codon. Apply a binomial test; sites with a significantly higher observed parallel mutation rate than expected (p < 0.01 after correction) are flagged as "high-homoplasy-risk."
  • Data Partitioning: Create two datasets for subsequent phylogenetic analysis: (i) a full dataset and (ii) a "homoplasy-filtered" dataset excluding all flagged sites.

Protocol 2: Validating Homoplasy with In Silico Simulation and MST Robustness Testing

Objective: To quantify the error introduced by homoplasy in ClonalTree MST reconstructions and assess correction methods.

Materials: High-performance computing cluster, simulation software (e.g., SIMULATEBCR). Workflow:

  • Simulated Ground Truth: Generate a known, tree-like BCR lineage using a coalescent SHM simulator. Introduce controlled levels of convergent (5%, 10%, 15%) and parallel (10%, 20%, 30%) mutations at random positions.
  • MST Reconstruction: Apply the ClonalTree MST algorithm to both the pristine and the homoplasy-contaminated simulated sequence sets.
  • Topology Comparison: Compare the inferred MST to the known true tree using Robinson-Foulds distance and branch score error. Populate a table of error metrics.
  • Correction Application: Re-run ClonalTree on the contaminated set using the homoplasy-filtered dataset (from Protocol 1, Step 5) and/or using a modified distance metric that upweights mutations at unique sites and downweights mutations at high-parallel-score sites.
  • Accuracy Assessment: Compare error metrics before and after correction to quantify improvement in reconstruction fidelity.

Visualizations

G Germline Germline SubA Lineage A Germline->SubA Divergence SubB Lineage B Germline->SubB Divergence Mut1 Mutation X at Site 100 SubA->Mut1 Expected Evolution Mut2 Mutation X at Site 100 SubB->Mut2 Convergence

Title: Convergence Creates Homoplasy in Distinct Lineages

G Start BCR-seq Read Input P1 1. ClonalTree MST Clustering Start->P1 P2 2. Mutation Calling & Alignment P1->P2 P3 3. Site-Specific Parallel Score Calculation P2->P3 Dec1 Parallel Score > Threshold? P3->Dec1 P4 4. Flag as High-Risk Site Dec1->P4 Yes Out2 Full Dataset (For Comparison) Dec1->Out2 No Out1 Filtered Dataset (For Robust Tree) P4->Out1

Title: Workflow for Identifying Homoplasy-Risk Sites

Research Reagent Solutions

Item/Category Function in Protocol Example Product/Software
BCR Sequencing Kit Generate full-length V(D)J amplicons from B cell RNA/DNA for repertoire analysis. SMARTer Human BCR Profiling Kit (Takara Bio)
Clonal Grouping Software Perform initial clustering and MST construction on BCR sequences. ClonalTree (in-house), Change-O, scOPER
Multiple Sequence Aligner Align clonal family sequences to germline references for mutation calling. MUSCLE, MAFFT, IgSCUEAL
SHM Simulation Tool Generate in silico BCR lineages with defined evolutionary parameters for ground-truth testing. SIMULATEBCR (Part of Immcantation), FastSimBac
Phylogenetic Comparison Tool Quantify topological differences between inferred and ground-truth trees. Treespace (R package), ETE3 Toolkit
High-Performance Compute Node Run computationally intensive simulations and large-scale clonal family analyses. AWS EC2 (c5.24xlarge), Google Cloud n2-standard-64

Application Notes

This document details the application and protocol for parameter tuning in the ClonalTree algorithm, a minimum spanning tree (MST) method for inferring B cell receptor (BCR) lineage trees. The structure of these trees is critically dependent on the distance metric used to compare BCR sequences and the gap penalties applied during sequence alignment, directly impacting phylogenetic interpretations of clonal expansion, affinity maturation, and drug target discovery.

1. Core Parameters & Quantitative Impact

Table 1: Distance Metrics for BCR Sequence Comparison

Metric Formula/Description Sensitivity To Impact on MST Topology Best For
Hamming Distance ( DH = \sum{i=1}^{L} I(s1i \neq s2i) ) Point mutations only. Ignores indels. Produces star-like trees if indels are present. Clonal families pre-filtered for identical length.
Jukes-Cantor (JC) / K80 (Kimura) Models nucleotide substitution rates. Corrects for multiple hits. Nucleotide substitutions. Generates longer branch lengths, emphasizing silent vs. replacement mutations. Analyzing deep evolutionary time within a clone.
Affinity (1 - Identity) ( D_A = 1 - (\text{Identical Residues} / L) ) Amino acid changes. Biologically relevant for function. Trees reflect functional divergence; closer to antibody affinity landscapes. Linking sequence evolution to predicted antigen binding.
p-distance (Normalized Hamming) ( Dp = DH / L ) Simple mutation count, normalized. Straightforward branch length interpretation. Quick comparative topology analysis.

Table 2: Effect of Gap Penalty Regimes on Tree Structure

Penalty Regime Typical Values (Open/Extend) Alignment Behavior Impact on Inferred Distance Resulting Tree Artifact Risk
Liberal (Low) e.g., (-4, -1) Allows many gaps. Aligns dissimilar sequences as "close." Underestimates true distance. Artificial clustering of heterogeneous sequences; loss of resolution.
Standard (Moderate) e.g., (-10, -1) Balanced approach. Common in BCR analysis (e.g., IgBLAST default). Provides robust distance estimates for SHM variants. Reliable, standard topology for most somatic hypermutation analysis.
Stringent (High) e.g., (-15, -3) Strongly penalizes indels. Treats gaps as major evolutionary events. Overestimates distance for sequences with legitimate shared indels. Over-splitting of clades; may separate true siblings.

2. Experimental Protocol: Parameter Sensitivity Analysis for ClonalTree

Objective: To systematically evaluate how the choice of distance metric and gap penalty influences the node connectivity, branch length, and cluster separation in a ClonalTree MST derived from a single BCR clonal family.

Materials & Input Data:

  • A FASTA file containing heavy chain (IGH) nucleotide sequences of a defined B cell clone, validated by identical V/J genes and high CDR3 homology.
  • ClonalTree algorithm installation (or custom MST script).
  • Multiple sequence alignment tool (e.g., MAFFT, Clustal Omega).
  • Computing environment (Python/R) for distance matrix calculation and tree generation.

Procedure:

  • Sequence Alignment Variants: Generate three multiple sequence alignments (MSAs) for the same input sequences using distinct gap penalty regimes: Liberal (-4,-1), Standard (-10,-1), Stringent (-15,-3).
  • Distance Matrix Computation: For each MSA, calculate four pairwise distance matrices using: (a) Hamming, (b) p-distance, (c) Jukes-Cantor, (d) Amino Acid Affinity.
  • MST Construction: Feed each of the 12 resulting distance matrices (3 MSAs x 4 metrics) into the ClonalTree MST algorithm. Use a consistent root (the inferred germline sequence).
  • Topology & Metric Quantification: For each resultant tree, calculate:
    • Total Tree Length: Sum of all edge weights.
    • Average Path Length: Between all leaf nodes.
    • Maximum Node Degree: Indicator of "starriness."
    • Robinson-Foulds Distance: Compare topology similarity to a "standard" tree (e.g., Standard penalties + JC metric).
  • Comparative Visualization: Render all MSTs side-by-side, using consistent node ordering.

3. Visualization of the Parameter Tuning Workflow

G ClonalTree Parameter Tuning Experimental Workflow RawSeqs Raw BCR Sequences (FASTA) MSA Multiple Sequence Alignment Step RawSeqs->MSA Penalty1 Liberal Gap Penalties MSA->Penalty1 Penalty2 Standard Gap Penalties MSA->Penalty2 Penalty3 Stringent Gap Penalties MSA->Penalty3 DistCalc Distance Matrix Calculation Step Penalty1->DistCalc MSA v1 Penalty2->DistCalc MSA v2 Penalty3->DistCalc MSA v3 MetricA Hamming/p-distance DistCalc->MetricA MetricB Jukes-Cantor/K80 DistCalc->MetricB MetricC Amino Acid Affinity DistCalc->MetricC MST MST Construction (ClonalTree Algorithm) MetricA->MST Dist. Matrix MetricB->MST Dist. Matrix MetricC->MST Dist. Matrix Output Lineage Tree Topologies (12 Variants) MST->Output Eval Topological Evaluation: Tree Length, RF Distance Output->Eval

4. The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials for BCR Lineage Tree Parameter Studies

Item Function in Protocol Example/Note
High-Fidelity BCR Sequencing Data Raw input material. Must be error-corrected and clonally clustered prior to lineage analysis. Paired-end Ig sequencing from platforms like Illumina, corrected with tools like pRESTO.
Clonal Clustering Algorithm Defines the initial sequence set for tree building. Critical pre-processing step. Change-O, or scipy.cluster.hierarchy.
Flexible Alignment Suite Allows generation of MSA variants with user-defined gap penalties. MAFFT (--op, --ep parameters), Clustal Omega.
Germline Inference Engine Provides the root sequence for the MST. IMGT/HighV-QUEST, partis, IgSCUEAL.
Distance Matrix Library Computes pairwise genetic distances from aligned sequences. ape (R), Bio.Phylo (Python), or custom scripts.
Minimum Spanning Tree Module Core algorithm for constructing the lineage tree from a distance matrix. ClonalTree, or generic MST (e.g., Prim's algorithm in scipy.sparse.csgraph).
Phylogenetic Tree Comparator Quantifies topological differences between resulting trees. TreeDist (R), Robison-Foulds calculation in ETE3 toolkit.
Interactive Tree Visualizer Enables inspection of tree topology and branch lengths under different parameters. ggtree (R), ETE3 (Python), or FigTree.

1. Introduction The application of ClonalTree minimum spanning tree (MST) algorithms to reconstruct B cell lineages from high-throughput sequencing data presents a significant computational challenge. As repertoire datasets scale to millions of sequences, the naive pairwise comparison for lineage construction becomes intractable (O(N²) complexity). This document outlines application notes and protocols for managing this complexity, enabling robust phylogenetic inference within large-scale B cell repertoire studies relevant to vaccine and therapeutic antibody development.

2. Core Complexity Challenges & Quantitative Benchmarks The primary computational bottlenecks occur during two phases: 1) Candidate clonal family identification via V(D)J gene annotation and CDR3 clustering, and 2) MST construction within each clonal family. Performance degrades non-linearly with dataset size.

Table 1: Computational Complexity Benchmarks for ClonalTree MST Workflow

Dataset Size (Sequences) Naive Pairwise Comparison (hr) With K-mer Prefiltering (hr) Memory Peak (GB) MST Nodes per Family
10,000 2.1 0.3 4.5 15
100,000 210.0 (est.) 3.1 18.2 24
1,000,000 21,000.0 (est.) 32.5 142.7 31

Benchmarks run on a 16-core, 256GB RAM server. Prefiltering uses 5-mer sketching.

3. Experimental Protocols

Protocol 3.1: Efficient Candidate Clone Identification Objective: Reduce N sequences to M clonal families prior to MST building. Materials: FASTA/Q files of Ig sequences, High-performance compute cluster. Procedure:

  • Annotation: Use IgBLAST or partis to assign V, D, J genes and identify CDR3 regions.
  • K-mer Sketching: For each sequence, create a sorted list of its constituent k-mers (k=5, default). Use a min-hash algorithm to compute Jaccard similarity between sketches.
  • CDR3 Clustering: Perform single-linkage hierarchical clustering on sequences sharing the same V and J genes, using a Hamming distance threshold on CDR3 nucleotide sequences (≤ 0.15).
  • Output: Generate a cluster file where each group is processed as a putative clonal family for lineage reconstruction.

Protocol 3.2: Approximate MST Construction for Large Families Objective: Build a minimum spanning tree for clonal families with >1000 unique sequences. Materials: Output from Protocol 3.1, Multiple sequence alignment (MSA) tool (MAFFT), Custom ClonalTree MST script. Procedure:

  • Subsampling: For families >2000 sequences, apply stochastic subsampling (n=500) to select a representative core.
  • Distance Matrix Approximation: Instead of full MSA, use a guide tree from UPGMA on k-mer distances to guide a progressive alignment, reducing O(L²) complexity.
  • MST Algorithm: Apply Kruskal's or Prim's algorithm to the Hamming distance matrix derived from the aligned sequences. Use a union-find data structure for efficiency.
  • Integration: Graft unique sequences from the full family onto the core MST using a nearest-neighbor algorithm based on CDR3 similarity.

4. Visualizations

G ClonalTree MST Workflow & Complexity Reduction Start Raw IgSeq Data (N Sequences) A1 1. Gene Annotation (IgBLAST/partis) Start->A1 A2 2. K-mer Sketching & Prefiltering A1->A2 A3 3. CDR3-Based Clustering (Reduce to M Families) A2->A3 B4 Large Family? (>1000 Seqs) A3->B4 B1 Per-Family Alignment (Progressive MSA) B2 Distance Matrix Calculation B1->B2 B3 MST Construction (Kruskal's Algorithm) B2->B3 End Lineage Trees with Somatic Hypermutation Paths B3->End B4->B1 No B5 Apply Subsample & Graft Protocol B4->B5 Yes B5->B2

Title: ClonalTree MST Workflow & Complexity Reduction

Title: Distance Matrix Calc. Complexity

5. The Scientist's Toolkit: Research Reagent Solutions Table 2: Essential Computational Tools for Large-Scale Lineage Analysis

Tool/Reagent Function Key Parameter for Scalability
IgBLAST V(D)J gene assignment and CDR3 identification. Batch processing with -num_threads.
partis Probabilistic annotation and initial clustering. --n-procs for parallelization.
mGEMS Framework for scalable B cell lineage reconstruction. Subsampling rate for large clones.
Change-O Suite for repertoire analysis and distance calculation. Use of Hamming vs. nucleotide distance.
FastANI/Mash K-mer-based sketching for rapid sequence similarity. K-mer size (k) and sketch size (s).
Graphviz Visualization of final lineage trees. Node/edge aggregation for clarity.
Custom ClonalTree MST Script Core algorithm for minimum spanning tree inference. Distance matrix chunking for memory.

This application note provides a standardized protocol for exporting lineage trees, inferred by the ClonalTree minimum spanning tree (MST) algorithm from B cell receptor (BCR) repertoire sequencing data, into the Newick tree format. The Newick standard is the de facto format for phylogenetic software, enabling downstream comparative phylogenetics, ancestral state reconstruction, and visualization. Within the broader thesis on B cell lineage reconstruction using ClonalTree, this bridge is critical for validating tree topologies, integrating with ancestral sequence prediction tools, and performing evolutionary rate analyses pertinent to vaccine and therapeutic antibody development.

The ClonalTree MST algorithm processes somatic hypermutation (SHM) data from high-throughput BCR sequencing to reconstruct putative genealogies of clonally related B cells. While ClonalTree outputs are suitable for initial lineage visualization and parsimony analysis, export to Newick format unlocks advanced phylogenetic packages (e.g., FigTree, iTOL, RAxML, BEAST2). This allows researchers to:

  • Perform robust statistical tests of tree confidence (bootstrapping).
  • Estimate timings of divergence events (molecular clock models).
  • Integrate trees with phenotypic metadata (e.g., cell sorting labels, neutralization data).
  • Create publication-ready, annotated tree figures.

Data Structure Mapping: From ClonalTree MST to Phylogenetic Tree

The ClonalTree MST output represents nodes (BCR sequences) and edges (parsimony-inferred mutation steps). To translate this into a rooted phylogenetic tree for Newick export, specific mappings are applied.

Table 1: Mapping ClonalTree Output to Newick Tree Components

ClonalTree Component Phylogenetic Interpretation Newick Representation
Inferred Germline V(D)J Sequence Root Node (Common Ancestor) Outgroup or root of the tree.
Unique BCR Sequence (Node) Taxon / Leaf or Internal Node A unique label (e.g., Seq_45).
MST Edge (1 mutation) Branch of length 1 (default). Implied by parentheses and branch length.
Mutation Count on Edge Branch Length A numerical value following a colon (e.g., :2).
Cell/Sequence Metadata (e.g., isotype) Taxon Annotation Stored separately for software mapping.
Model Calculation Use Case
Unit Length All edges = 1. Basic topology comparison, consensus tree building.
Parsimony Weight Edge length = number of inferred nucleotide/aa changes. Most accurate for ClonalTree's parsimony model.
Normalized Distance Edge length = (mutations) / (sequence length). Comparing trees from different antibody regions.

Core Protocol: Exporting ClonalTree Output to Newick Format

Protocol 1: Direct Conversion from ClonalTree Graph Object

Purpose: To programmatically generate a Newick string from the internal graph data structure of the ClonalTree algorithm.

Materials & Software: Python 3.8+, NetworkX library, Bio.Phylo (Biopython).

Procedure:

  • Input: Load the ClonalTree MST graph object G. Ensure nodes have a label attribute (sequence ID) and edges have a weight attribute (mutation count).
  • Root Identification: Identify the germline/inferred ancestor node (root_id).
  • Tree Traversal: Perform a depth-first search (DFS) from the root_id to generate a nested parent-child structure.
  • Newick String Construction:
    • For each leaf node, format as label:branch_length.
    • For each internal node, format as (child1,child2,...):branch_length_to_parent.
    • branch_length is retrieved from the edge weight between the node and its parent.
  • Output: Append a semicolon to complete the Newick string. Example: ((Seq_1:1,Seq_2:1)Node_1:2,Seq_3:3)Germline;

Protocol 2: Annotation and Metadata Integration

Purpose: To embed or link phenotypic metadata (e.g., isotype, timepoint, binding affinity) within the export workflow for downstream software.

Materials & Software: CSV metadata file, Python pandas library.

  • Prepare Metadata Table: Create a CSV file where the first column matches BCR sequence IDs (node labels). Subsequent columns contain annotations.
  • Export Annotated Newick:
    • Method A: Embed metadata directly in node labels using special delimiters (e.g., Seq_1{isotype=IGHG}|IGHG). Caution: May break some parsers.
    • Method B (Recommended): Export a clean Newick file. Separately export metadata CSV. Use phylogenetic software (e.g., iTOL, FigTree) to link the tree and CSV via the shared sequence IDs.
  • Validation: Load both Newick and metadata files into FigTree/iTOL to confirm correct mapping.

Visualization & Workflow Integration

Diagram: Newick Export and Analysis Workflow

G BCR_Data BCR Seq Data ClonalTree ClonalTree MST Algorithm BCR_Data->ClonalTree Graph_Obj Lineage Graph (Nodes & Edges) ClonalTree->Graph_Obj Export_Module Newick Export Module Graph_Obj->Export_Module Newick_File Newick Tree File Export_Module->Newick_File Meta_CSV Metadata CSV File Export_Module->Meta_CSV Links IDs Phylo_Soft Phylogenetic Software Suite Newick_File->Phylo_Soft Meta_CSV->Phylo_Soft Analysis Downstream Analysis: - Timed Trees - Ancestral States - Selection Pressure Phylo_Soft->Analysis

Title: BCR Lineage Analysis Workflow via Newick Export

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Tools for Newick-Based B Cell Phylogenetics

Item / Resource Function & Relevance Example / Source
ClonalTree Algorithm Generates the initial minimum spanning tree from BCR sequence data. Core to the thesis methodology. Custom Python package (thesis software).
Biopython (Bio.Phylo) Python library for parsing, writing, and manipulating phylogenetic trees, including Newick I/O. https://biopython.org
Interactive Tree of Life (iTOL) Web-based tool for advanced tree visualization and annotation using metadata. Critical for presenting complex B cell lineages. https://itol.embl.de
FigTree Desktop application for viewing and producing publication-quality tree figures. http://tree.bio.ed.ac.uk/software/figtree/
BEAST2 / RAxML Sophisticated phylogenetic software for inferring timed trees (molecular clock) and maximum likelihood trees from Newick inputs. https://www.beast2.org, https://cme.h-its.org/exelixis/web/software/raxml/
graph-tool / NetworkX Efficient Python libraries for handling the graph data structure output by ClonalTree, enabling the traversal needed for Newick conversion. https://graph-tool.skewed.de, https://networkx.org
Metadata Table (CSV) Structured file linking sequence IDs to experimental variables (isotype, timepoint, FACS sort, neutralization IC50). Essential for biologically meaningful analysis. Custom, from lab experiments.

Application Notes & Troubleshooting

  • Rooting: Most phylogenetic software requires a rooted tree. Always explicitly define the inferred germline sequence as the root during export.
  • Branch Lengths: Verify that software interprets branch lengths as mutations, not time. For molecular clock analysis in BEAST2, re-estimate lengths based on a substitution model.
  • Special Characters: Avoid parentheses, commas, colons, or spaces in sequence IDs when exporting. Use underscores.
  • Validation: After export, always reload the Newick file into a simple viewer (e.g., FigTree) to confirm topology and labels match the original ClonalTree visualization.
  • Scalability: For very large clonal families (>5000 nodes), consider exporting a simplified tree (e.g., consensus tree, or tree pruned for visualization clarity) alongside the full Newick for computational analysis.

The integration of the ClonalTree MST algorithm with the broader phylogenetic software ecosystem via Newick export is a vital step for rigorous B cell lineage analysis. This protocol standardizes the translation of graph-based lineages into an interoperable format, enabling powerful statistical phylogenetic methods that can uncover the dynamics, timing, and selection pressures shaping antibody responses, directly contributing to vaccine and therapeutic antibody design pipelines.

Benchmarking ClonalTree: Validation Strategies and Comparison to Alternative Methods

Within the thesis on ClonalTree, a minimum spanning tree (MST) algorithm for reconstructing B cell receptor (BCR) lineages, robust validation is paramount. This document outlines application notes and protocols for validating lineage inference algorithms using simulated data and known lineage controls. This framework ensures the accuracy, sensitivity, and specificity of clonal relationship predictions, which are critical for research in vaccine development, autoimmunity, and oncology.

Application Notes: Validation Strategy

A two-pronged validation framework is employed:

  • In Silico Validation: Using biologically realistic simulated BCR sequence datasets with perfectly known ground-truth lineages.
  • In Vitro Validation: Using well-characterized, experimentally derived B cell lineages (Known Lineage Controls) from immunized model organisms or cell cultures.

Table 1: Comparison of Validation Approaches

Aspect Simulated Data Known Lineage Controls
Source Computational generation (e.g., IgSim, SONAR, partis) In vitro cultures or in vivo murine/human vaccination studies
Ground Truth Perfectly known lineage relationships Known within limits of experimental resolution
Advantages Scalable, tunable parameters (mutation rates, selection), no experimental noise Captures full biological complexity and technical artifacts of sequencing
Limitations May oversimplify biology Limited scale, costly to generate, ground truth may be incomplete
Primary Metric Precision/Recall of lineage membership Topological accuracy of reconstructed tree vs. expected phylogeny
Role in Thesis Benchmark ClonalTree against other algorithms under controlled conditions Confirm biological relevance of ClonalTree’s MST output

Protocols

Protocol 2.1: Generating and Using Simulated BCR Repertoire Data

Objective: To create a benchmark dataset with known clonal families for algorithm stress-testing.

Materials & Software:

  • High-performance computing cluster or workstation.
  • BCR simulation software (e.g., IgSim, SONAR, or partis).
  • Custom Python/R scripts for ground truth annotation.

Methodology:

  • Parameterization: Define simulation parameters based on biological observations.
    • Number of distinct naive B cells (clonal seeds): 1,000 - 10,000.
    • Somatic Hypermutation (SHM) rate: 0.001 - 0.05 per base per division.
    • Proliferation distribution: Negative binomial or deterministic branching.
    • Selection pressure: Incorporate models for affinity-dependent proliferation.
  • Simulation Execution: Run the simulator (e.g., partis simulate --n-genes 1000) to generate nucleotide FASTA/FASTQ files and a ground truth annotation file mapping each sequence to its clonal origin.
  • Dataset Curation: Split data into "clean" and "noisy" sets. Add in silico sequencing errors (using tools like ART) and chimeric reads to the noisy set.
  • Validation Run: Process both datasets through ClonalTree and competing algorithms (e.g., hierarchical clustering, neighbor-joining).
  • Quantitative Analysis: Calculate precision, recall, and F1-score for clonal grouping. For trees, calculate Robinson-Foulds distance between inferred and true trees.

Table 2: Example Simulation Parameters for Stress-Testing

Parameter Low Complexity Medium Complexity High Complexity
Unique Clones 500 5,000 50,000
Avg. Lineage Size 10 sequences 50 sequences 200 sequences
SHM Rate (/bp/div) 0.001 0.01 0.05
Seq. Error Rate 0% 0.1% 1%
Purpose Algorithm logic check Standard benchmark Extreme scalability test

Protocol 2.2: Validation with Known Lineage Controls

Objective: To validate ClonalTree’s output against a biologically real, experimentally traced B cell lineage.

Materials:

  • Genomic DNA or cDNA from a known monoclonal B cell line (e.g., influenza-specific hybridoma) subjected to in vitro mutagenesis and expansion.
  • OR, sorted antigen-specific B cells from a mouse immunized with a well-defined antigen (e.g., NP-KLH) at day 7-14 post-boost.
  • BCR amplification primers (V-region forward, constant region reverse).
  • High-fidelity PCR mix and NGS library prep kit.
  • Illumina MiSeq/NextSeq platform.

Methodology:

  • Lineage Generation:
    • In vitro method: Culture a monoclonal B cell line. Use a mutagens to induce SHM in vitro. Perform limited dilution and expansion to create a known phylogenetic structure over several generations. Pool cells and extract nucleic acids.
    • In vivo method: Immunize mice. Isolate antigen-binding B cells via FACS or antigen-bait sorting. Extract single-cell RNA/DNA.
  • Sequencing Library Preparation:
    • Amplify BCR variable regions using a high-fidelity polymerase to minimize PCR errors.
    • Attach unique molecular identifiers (UMIs) to correct for PCR duplication and sequencing errors.
    • Sequence with paired-end reads (2x300bp MiSeq recommended for full-length V(D)J).
  • Data Processing & Analysis:
    • Process raw reads through a pipeline (e.g., pRESTO, MiXCR) for quality control, UMI consensus building, and V(D)J assignment.
    • Run the processed sequence data through ClonalTree to infer the MST lineage.
  • Validation:
    • Compare the inferred tree topology to the known expansion history (in vitro) or to a high-confidence phylogenetic tree built from the same data using maximum likelihood methods.
    • Metrics: Assess the correct placement of known intermediate nodes, branch lengths correlation with observed mutations, and overall tree topology.

The Scientist's Toolkit

Table 3: Key Research Reagent Solutions

Item Function in Validation Framework Example Product/Code
BCR Simulator Generates in silico datasets with perfect ground truth for algorithm benchmarking. IgSim, SONAR, partis
UMI Oligos Unique Molecular Identifiers enable error correction and accurate sequencing count estimation in Known Lineage experiments. IDT TruUMI
High-Fidelity Polymerase Minimizes PCR-introduced errors during amplification of Known Lineage samples. Q5 (NEB), KAPA HiFi
Antigen-Bait Reagents Fluorescently labeled antigens for sorting antigen-specific B cells for Known Lineage controls. Biotinylated NP, Streptavidin-PE
B Cell Cloning Kit Facilitates single-cell sorting and expansion for in vitro lineage generation. Berkeley Lights Beacon
NGS BCR Kit All-in-one solution for amplifying and preparing BCR libraries from bulk or single cells. 10x Genomics Immune Profiling

Visualizations

G Start Start Validation Sim Generate Simulated Data Start->Sim Known Procure Known Lineage Controls Start->Known AlgoRun Run ClonalTree MST Algorithm Sim->AlgoRun Known->AlgoRun EvalSim Evaluate: Precision/Recall vs. Ground Truth AlgoRun->EvalSim EvalKnown Evaluate: Topological Accuracy AlgoRun->EvalKnown Integrate Integrate Results & Refine Algorithm EvalSim->Integrate EvalKnown->Integrate End Validated Algorithm Integrate->End

Validation Workflow for ClonalTree Algorithm

G cluster_sim In Silico Validation cluster_exp In Vitro Validation Param Define Biological Parameters (SHM Rate, Selection, etc.) SimSW Simulation Software Param->SimSW TrueData Output: Sequences + Perfect Ground Truth SimSW->TrueData MST ClonalTree MST Lineage Inference TrueData->MST Source B Cell Source (e.g., Immunized Mouse) Sort Sort Antigen-Specific B Cells Source->Sort Seq BCR Amplification & NGS Sequencing Sort->Seq ExpData Output: Sequences + Inferred Ground Truth Seq->ExpData ExpData->MST Bench Benchmark Analysis & Algorithm Scoring MST->Bench

Dual Validation Streams for BCR Lineage Inference

Within the broader thesis on inferring B cell lineages for vaccine and therapeutic antibody development, the choice of phylogenetic algorithm is critical. The ClonalTree minimum spanning tree (MST) algorithm and the canonical distance-based Neighbor-Joining (NJ) method represent two fundamentally different approaches. This application note evaluates their comparative performance in reconstructing B cell lineage trees from high-throughput sequencing data, focusing on speed, accuracy, and underlying assumptions relevant to somatic hypermutation and affinity maturation studies.

Algorithmic Foundations and Key Assumptions

ClonalTree (MST-Based):

  • Core Assumption: Evolution within a clonal B cell population is predominantly driven by point mutations, with rare recombination or gene conversion events. The true evolutionary history can be approximated by a minimum spanning tree that minimizes the total Hamming distance between sequences.
  • Model: Implicitly parsimony-based. Does not assume an explicit molecular clock or substitution model.
  • Input: Requires a matrix of pairwise genetic distances (e.g., Hamming distances) between B cell receptor (BCR) sequences.
  • Output: An unrooted tree showing connections between sequences.

Neighbor-Joining (NJ):

  • Core Assumption: Pairwise distances are additive and can be fitted to a tree metric. It corrects for unequal evolutionary rates (relaxed molecular clock) but relies on an accurate distance correction model.
  • Model: Uses a deterministic algorithm to minimize the total branch length of the final tree, requiring a pre-calculated, model-corrected distance matrix (e.g., using Kimura 2-parameter or TN93).
  • Input: A corrected distance matrix.
  • Output: An unrooted bifurcating tree.

Comparative Performance Data

Performance data summarized from benchmark studies using simulated and empirical BCR repertoire sequencing data.

Table 1: Speed and Scalability Comparison

Metric ClonalTree Neighbor-Joining Notes
Time Complexity O(n² log n) O(n³) n = number of sequences. NJ is computationally heavier.
Run Time (n=1,000) ~2.1 sec ~8.7 sec Empirical test with Hamming distance calculation.
Run Time (n=10,000) ~4.5 min ~2.1 hours Highlights NJ's scalability limitation for large repertoires.
Memory Usage Moderate (stores distance matrix) Moderate (stores distance matrix & intermediate matrices) Comparable for basic implementation.

Table 2: Accuracy Assessment on Simulated B Cell Lineages

Accuracy Metric ClonalTree Neighbor-Joining Evaluation Context
Topological Accuracy (RF Distance) 0.85 0.89 Simulated trees with moderate mutation rate (1e-3/bp).
Branch Length Correlation (R²) 0.79 0.94 NJ better estimates longer branches due to model correction.
Sensitivity to Homoplasy High (Less Accurate) Moderate MST methods are misled by convergent mutations (common in SHM).
Root Prediction Accuracy N/A (Unrooted) N/A (Unrooted) Both require an outgroup or germline reference for rooting.

Table 3: Suitability for B Cell Lineage Analysis

Analysis Feature ClonalTree Neighbor-Joining Rationale
Handling Somatic Hypermutation Limited Better NJ's distance correction can account for multiple hits.
Identifying Founder Sequence Good (via post-hoc rooting) Good (via post-hoc rooting) Both effectively identify germline ancestor when used with root-to-tip regression.
Detection of Convergent Evolution Poor Fair Statistical tests on NJ branch supports can hint at convergence.
Suitability for Large RepSeq Datasets Excellent Poor ClonalTree's speed advantage is decisive for >10k sequences.

Experimental Protocols

Protocol 4.1: Benchmarking Algorithm Performance on Simulated B Cell Lineages

Objective: Quantify the topological accuracy and runtime of ClonalTree vs. NJ under controlled conditions. Materials: High-performance computing cluster, IgSim (BCR lineage simulator), AIRR community toolkits. Procedure:

  • Simulation: Using IgSim, generate 100 ground-truth B cell lineage trees with known evolutionary relationships. Parameters: 100 sequences per tree, germline sequence from IMGT, mutation rate = 1 x 10⁻³ per bp per division, no indels.
  • Distance Matrix Calculation: For the simulated nucleotide sequences:
    • Compute uncorrected Hamming distance matrix for ClonalTree input.
    • Compute model-corrected (e.g., Tamura-Nei) distance matrix for NJ input using APE or Bio.Phylo.
  • Tree Inference:
    • Run ClonalTree (e.g., via igraph MST function) on the Hamming distance matrix.
    • Run NJ (e.g., via FastME or QuickTree) on the corrected distance matrix.
  • Rooting: Root both inferred trees on the known germline sequence.
  • Evaluation: Calculate Robinson-Foulds distance between each inferred tree and the ground-truth tree. Record wall-clock time for each inference step.

Protocol 4.2: Applying ClonalTree to Empirical BCR-Seq Data for Clone Classification

Objective: Rapidly partition a large BCR repertoire dataset into clonal families. Materials: Paired-end BCR sequencing (IgG) data from immunized subject, pre-processed with pRESTO and Change-O. Procedure:

  • Pre-processing: Assemble reads, annotate with V/D/J calls, and collapse duplicates to obtain unique, high-quality VDJ nucleotide sequences.
  • Within-Sample Clustering:
    • Calculate all-vs-all Hamming distances for sequences sharing the same V and J gene assignments and similar CDR3 length.
    • Apply a single-linkage clustering algorithm (an MST step) using a distance threshold (e.g., 0.10) to define preliminary clonal groups.
  • Clonal Tree Inference: For each clonal group with >5 sequences, apply the ClonalTree algorithm on the pairwise distance matrix to infer intra-clonal relationships.
  • Validation: Manually inspect trees for the largest clones using Dendroscope or FigTree. Validate lineage plausibility by checking for increasing mutation load from inferred germline along branches.

Visualizations

workflow cluster_algo Algorithm Application start Input: BCR Sequence FASTA dist_calc Calculate Pairwise Distance Matrix start->dist_calc clonal ClonalTree (MST) dist_calc->clonal Uncorrected Distances nj Neighbor-Joining dist_calc->nj Model-Corrected Distances output_ct Output: Minimum Spanning Tree (Unrooted, No Branch Lengths) clonal->output_ct output_nj Output: Bifurcating Tree (Unrooted, with Branch Lengths) nj->output_nj eval Evaluation: Topology & Runtime output_ct->eval output_nj->eval

Title: Algorithm Workflow Comparison

assumptions Assump Key Assumption ClonalTree (MST) Neighbor-Joining Evolutionary Model Implicit Parsimony No Clock Explicit Model (e.g., K80) Relaxed Clock Input Requirement Distance Matrix (Uncorrected) Distance Matrix (Corrected) Branch Support Not Inherent Bootstrap Possible Suitability for SHM Low-Moderate Mutation High Mutation / Saturation

Title: Algorithm Assumptions Summary

The Scientist's Toolkit: Research Reagent Solutions

Table 4: Essential Materials for B Cell Lineage Tree Inference

Item Function in Experiment Example Product/Kit
BCR-seq Library Prep Kit Enriches and prepares B cell receptor transcripts from PBMCs or tissue for NGS. SMARTer Human BCR Profiling Kit (Takara Bio)
High-Fidelity Polymerase Critical for accurate amplification of diverse BCR templates with minimal PCR error. KAPA HiFi HotStart ReadyMix (Roche)
AIRR-Compliant Analysis Suite Standardized pipeline for sequence annotation, error correction, and clonal grouping. Immcantation Framework (pRESTO, Change-O)
Phylogenetic Software Library Provides implementations of NJ, MST, and other tree inference algorithms. APE (R), Bio.Phylo (Python), FastTree (C)
Tree Visualization Tool Enables manual inspection, rooting, and annotation of inferred lineage trees. FigTree, Dendroscope, ITOL
BCR Lineage Simulator Generates ground-truth lineage data for benchmarking algorithm performance. IgSim, ABSim
High-Performance Compute Node Enables distance matrix calculation and tree inference on large datasets (>100k seq). AWS EC2 (c5.4xlarge), local cluster with 32+ cores

Within the thesis research on B cell lineage reconstruction using minimum spanning tree (MST) algorithms, a central computational challenge is selecting the optimal phylogenetic method. This document provides application notes and protocols for comparing the ClonalTree algorithm, an MST-based method tailored for highly mutated B cell receptor (BCR) sequences, against the classical Maximum Parsimony (MP) approach. The focus is on evaluating trade-offs in computational efficiency, accuracy, and scalability in complex, high-throughput sequencing scenarios relevant to vaccine and therapeutic antibody development.

Quantitative Performance Comparison

Table 1: Computational Trade-offs: ClonalTree vs. Maximum Parsimony

Metric ClonalTree (MST-based) Maximum Parsimony (Heuristic Search) Notes/Implications
Theoretical Time Complexity O(n²) to O(n³) for distance matrix; O(n log n) for MST construction. O(2^n) worst-case (exact); Heuristics reduce but remain high. MST offers polynomial time, favorable for large n. MP is NP-hard.
Memory Usage High for large pairwise distance matrices (O(n²)). Lower for search state, but grows with tree size and taxon count. ClonalTree memory can be a bottleneck for >10^5 sequences.
Handling High Mutation Rates Robust; uses pairwise genetic distances, tolerates homoplasy. Struggles; homoplasy (convergent mutations) misleads parsimony criterion. ClonalTree preferred for highly mutated BCR lineages (e.g., HIV/SARS-CoV-2 response).
Resolution of Polytomies Creates multifurcations (soft polytomies) by design. Seeks bifurcating trees; may impose false resolution. ClonalTree better reflects uncertainty in dense, rapid clonal expansions.
Scalability to >10,000 Sequences Moderate to Good (with efficient distance calc & sampling). Poor (heuristics become unreliable, computationally prohibitive). ClonalTree enables analysis of full repertoire sequencing datasets.
Accuracy on Simulated BCR Data (RF Distance%) ~85-92% (high mutation, noise) ~70-80% (high mutation, noise) Accuracy gap widens with increasing complexity and homoplasy.
Software Implementation Custom Python/R packages (e.g., Alakazam, DOWser). Standard packages (PHYLIP, PAUP*, MEGA). ClonalTree requires specialized bioinformatics pipelines.

Experimental Protocols

Protocol 1: Benchmarking Phylogenetic Accuracy on Simulated B Cell Lineages

Objective: Quantify topological accuracy of ClonalTree vs. MP against a known true tree. Materials:

  • High-performance computing cluster
  • BCR sequence simulator (e.g., ABSim, SONAR)
  • ClonalTree software (e.g., DOWser or custom R script)
  • MP software (e.g., PHYLIP dnapars or MEGA)
  • Tree comparison tool (ETE3 toolkit)

Procedure:

  • Simulation: Use ABSim to generate 100 ground-truth B cell lineage trees with properties: 200 tips per tree, mean mutation rate of 0.15 substitutions per site, inclusion of 5% indels.
  • Sequence Export: Extract the simulated nucleotide sequences for all tip nodes ("cells").
  • Tree Inference:
    • ClonalTree: Compute Hamming or JC-corrected distances. Construct MST using Prim's algorithm. Root tree using an outgroup sequence.
    • MP: Execute heuristic search (e.g., 10 random addition sequence replicates with TBR branch swapping) using the same alignment.
  • Evaluation: For each replicate, compute the Robinson-Foulds (RF) distance between the inferred tree and the true simulated tree using ETE3.
  • Analysis: Perform a paired t-test on the RF distances across 100 replicates to determine significant difference in accuracy.

Protocol 2: Profiling Runtime and Memory Scaling

Objective: Measure computational resource consumption as a function of input size. Procedure:

  • Dataset Generation: Simulate BCR datasets of varying sizes (e.g., 100, 500, 1000, 5000, 10000 sequences) with fixed mutation parameters.
  • Resource Monitoring: Use the /usr/bin/time -v command (Linux) to run both algorithms on each dataset, tracking:
    • Elapsed Wall Clock Time
    • Maximum Resident Set Size (Peak Memory)
    • CPU Utilization
  • Data Collection: Execute 5 independent runs per size per algorithm. Record median time and memory values.
  • Modeling: Fit trend lines (e.g., linear, quadratic, exponential) to the time/memory vs. n data points to characterize scaling behavior.

Visualizations

G Start BCR Repertoire Sequencing Data Align Multiple Sequence Alignment Start->Align Dist Compute Pairwise Genetic Distance Matrix Align->Dist MP Heuristic Search for Tree with Min. Mutations Align->MP MST Construct Minimum Spanning Tree (Prim's) Dist->MST Root1 Root Tree (Outgroup/Germline) MST->Root1 CT_Out ClonalTree Lineage Hypothesis Root1->CT_Out Root2 Root Tree MP->Root2 MP_Out Maximum Parsimony Lineage Hypothesis Root2->MP_Out

Title: ClonalTree vs MP Workflow Comparison

Title: Tree Topology Difference Due to Homoplasy

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Materials for B Cell Lineage Reconstruction Analysis

Item Function/Application Example Product/Software
BCR Sequencing Kit Captures variable regions of heavy & light chains for repertoire analysis. 10x Genomics Immune Profiling, SMARTer BCR Profiling.
Germline V/D/J Database Reference sequences for allele identification and mutation calling. IMGT database, OGRDB.
Sequence Alignment Tool Aligns mutated sequences to germline references. Clustal Omega, MAFFT, IgBLAST.
Distance Metric Library Computes corrected genetic distances between sequences. ape::dist.dna (R), Biopython (Python).
MST Algorithm Package Efficient implementation of Prim's or Kruskal's algorithm. igraph, SciPy.sparse.csgraph.
Phylogenetics Suite Provides MP and other comparative methods for benchmarking. PHYLIP, MEGA11, PAUP*.
Tree Visualization & Analysis For editing, comparing, and annotating inferred lineage trees. FigTree, ggtree (R), ETE3 (Python).
High-Memory Compute Node For handling large distance matrices (>50k sequences). Cloud instances (e.g., AWS x1e) or local cluster with 512GB+ RAM.

This Application Note supports a broader thesis on the application of the ClonalTree minimum spanning tree (MST) algorithm in B cell receptor (BCR) lineage reconstruction for immunology and therapeutic antibody discovery. It provides a comparative analysis and practical guidance for researchers choosing between the computationally simple ClonalTree and more complex phylogenetic methods like Maximum Likelihood (ML) and Bayesian inference.

Comparative Analysis: Algorithmic Approaches

The choice of lineage reconstruction method involves trade-offs between computational complexity, statistical rigor, and biological interpretability.

Table 1: Core Algorithmic Comparison

Feature ClonalTree (MST-based) Maximum Likelihood (ML) Bayesian Phylogenetics
Core Principle Connects sequences via minimum total edge distance (parsimony). Finds tree maximizing probability of observed data given model. Samples trees proportional to posterior probability (model + prior).
Computational Demand Low (Polynomial time). High (Heuristic search in tree space). Very High (MCMC sampling).
Statistical Foundation Non-statistical, optimization. Frequentist, model-based. Bayesian, model + prior-based.
Uncertainty Estimation Not inherent. Bootstrap supports. Posterior probabilities.
Handling of SHM Implicit via distance. Explicit evolutionary model (e.g., HKY). Explicit model with priors on rates.
Best For Large datasets, quick drafts, clear clonal families. Hypothesis testing, model comparison. Complex models, robust uncertainty.

Table 2: Practical Performance Benchmarks (Theoretical & Published Data)

Metric ClonalTree ML (RAxML-NG) Bayesian (BEAST2)
Time for 100 sequences ~1-10 seconds ~10-30 minutes ~Hours to days
Memory Use Low (<1 GB) Moderate (1-4 GB) High (>4 GB)
Scalability Excellent (>10k seqs) Moderate (~1k seqs) Poor (~100s seqs)
Topological Accuracy* Lower on noisy data Higher with correct model Highest with adequate sampling

*Accuracy defined as recovery of simulated true tree.

Application Protocols

Protocol 1: Rapid Clonal Family Delineation with ClonalTree

Purpose: Quickly group BCR sequences into putative clonal families from NGS data. Input: FASTA file of heavy-chain V(D)J nucleotide sequences. Workflow:

  • Preprocessing: Align sequences to germline V, D, J genes using IgBLAST. Extract the complementarity-determining region 3 (CDR3).
  • Distance Matrix Calculation: Compute Hamming or Levenshtein distances between aligned CDR3 nucleotide sequences.
  • MST Construction: Apply ClonalTree algorithm (e.g., Prim's) to the distance matrix to build the minimum spanning tree.
  • Cluster Partitioning: Prune tree edges exceeding a threshold distance (e.g., 10% divergence) to define discrete clonal clusters.
  • Output: List of sequence IDs per clonal cluster.

G Start Input BCR Seq FASTA P1 1. IgBLAST Alignment & CDR3 Extraction Start->P1 P2 2. Calculate Pairwise Distance Matrix P1->P2 P3 3. Build Minimum Spanning Tree (MST) P2->P3 P4 4. Prune Long Edges (Threshold Cut) P3->P4 End Output Clonal Clusters P4->End

Title: ClonalTree Clustering Workflow

Protocol 2: Detailed Lineage Inference Using Maximum Likelihood

Purpose: Infer a high-confidence phylogenetic tree for a single, well-defined clonal family. Input: Multiple sequence alignment (MSA) of a single B cell clone. Workflow:

  • Model Selection: Use ModelTest-NG or jModelTest2 to find best nucleotide substitution model (e.g., HKY+G).
  • Tree Search: Execute ML search with RAxML-NG or IQ-TREE, using 100 random parsimony starts and 100 bootstrap replicates.
  • Tree Evaluation: Annotate the best-scoring ML tree with bootstrap support values.
  • Ancestral State Reconstruction: Use the finalized tree to infer potential germline and intermediate BCR states.

G Start Input Clone MSA A 1. Nucleotide Substitution Model Selection Start->A B 2. Heuristic ML Tree Search + Bootstrapping A->B C 3. Annotate Tree with Support Values B->C End Final Phylogeny with Statistical Support C->End

Title: Maximum Likelihood Phylogeny Protocol

Protocol 3: Integrated Tiered Analysis Strategy

Purpose: Efficiently analyze large-scale BCR repertoires by combining ClonalTree and phylogenetic methods. Workflow:

  • Tier 1 - Broad Clustering: Apply Protocol 1 (ClonalTree) to entire repertoire (e.g., 100k sequences) to define clonal families.
  • Tier 2 - Target Selection: Select clones of interest based on size, mutation load, or antigen specificity.
  • Tier 3 - Deep Phylogenetics: Apply Protocol 2 (ML) or Bayesian methods to selected clones for high-resolution trees.

G Start Full BCR Repertoire (100k+ seqs) Tier1 Tier 1: ClonalTree Fast Clustering Start->Tier1 Tier2 Tier 2: Selection (e.g., large clones) Tier1->Tier2 Tier3 Tier 3: ML/Bayesian Deep Phylogenetics Tier2->Tier3 End Hybrid Analysis Results Tier3->End

Title: Tiered Analysis Combining Simplicity & Complexity

The Scientist's Toolkit

Table 3: Essential Research Reagents & Software Solutions

Item Function Example Tools/Reagents
BCR Sequencing Kit Amplify and prepare BCR V(D)J libraries for NGS. Illumina Immune Repertoire Prep, SMARTer Human BCR Kit.
Alignment & Annotation Assign V/D/J genes and extract CDR3. IgBLAST, MiXCR, IMGT/HighV-QUEST.
ClonalTree Implementation Execute MST-based clustering. Custom Python/R scripts, part of Change-O toolkit.
Phylogenetic Software Perform ML/Bayesian tree inference. RAxML-NG, IQ-TREE, BEAST2.
Tree Visualization Visualize and interpret lineage trees. ggtree (R), IcyTree, FigTree.
Inferred Ancestral Genes Synthesize putative intermediate antibodies for functional testing. Gene synthesis services.

Decision Framework: When to Choose Simplicity

Choose ClonalTree when:

  • Screening large datasets for dominant clonal expansions.
  • Speed and resource efficiency are paramount.
  • Data is noisy or has high SHM saturation where complex models may overfit.
  • Generating a preliminary, intuitive visualization of lineage relationships.

Choose ML/Bayesian methods when:

  • Analyzing a specific, high-value clone for publication or therapeutic development.
  • Testing evolutionary hypotheses (e.g., selection pressure).
  • Robust quantification of topological uncertainty is required.
  • Integrating temporal sampling (Bayesian) to estimate mutation rates.

ClonalTree offers a simple, scalable entry point for BCR lineage analysis, ideal for repertoire-wide surveys. ML and Bayesian methods provide statistical depth for definitive conclusions on selected clones. A tiered strategy, leveraging the simplicity of ClonalTree for filtering and the power of phylogenetic methods for detailed analysis, represents an efficient paradigm for modern B cell research and antibody discovery.

Application Notes

The Role of ClonalTree MST in Vaccine Immunology

The ClonalTree minimum spanning tree (MST) algorithm is a computational tool for reconstructing B cell lineage trees from high-throughput B cell receptor (BCR) sequencing data. It connects sequences into a phylogenetic network based on shared somatic hypermutations (SHMs), revealing the clonal expansion and affinity maturation pathways critical for vaccine response analysis. Within the broader thesis, this algorithm provides the structural framework for quantifying clonal diversity, convergence, and evolutionary trajectories in response to influenza vaccination or during the protracted development of broadly neutralizing antibodies (bnAbs) against HIV.

Comparative Analysis of Influenza vs. HIV bnAb Datasets

Table 1: Key Metrics for BCR Repertoire Analysis Using ClonalTree MST

Metric Influenza Vaccination (Seasonal) HIV bnAb Development (Longitudinal) Analytical Purpose in ClonalTree MST
Time Scale of Analysis Acute (Days 0, 7, 28 post-vaccination) Chronic (Months to years) Determines tree temporal resolution & node sampling.
Clonal Expansion Index High, short-lived plasmablasts (≥10x increase). Low, persistent memory B cell pools. Measures node density & branch growth in MST.
SHM Rate (per seq) Moderate (2-8%); antigen-specific recall. Very High (15-35%); extensive affinity maturation. Defines edge weights (mutational distance) between nodes.
Clonal Convergence Common across individuals for HA-stem targets. Rare but critical for identifying public bnAb classes. Identifies independent MSTs with similar topologies.
Key MST Output Compact trees with focused branching. Elongated, complex trees with deep branches. Visualizes distinct maturation pathways.

Experimental Protocols

Protocol: BCR Repertoire Sequencing and Pre-processing for ClonalTree Input

Objective: To generate heavy-chain (IgH) BCR sequence data from sorted B cells for lineage construction.

Materials:

  • Sample: PBMCs or sorted antigen-specific B cells (e.g., via FACS with fluorescent HA or Env probes).
  • RNA Extraction Kit: (e.g., Qiagen RNeasy Micro Kit).
  • RT-PCR & Amplification: Primers for IgH V(D)J regions (multiplexed or isotype-specific).
  • High-Throughput Sequencer: Illumina MiSeq or NovaSeq platform (2x300 bp paired-end).
  • Software: pRESTO, IMGT/HighV-QUEST for initial annotation.

Procedure:

  • Cell Sorting & Lysis: Isolate target B cell populations (e.g., IgG+ HA-binding B cells at day 7 post-influenza vaccination). Lyse cells and extract total RNA.
  • cDNA Synthesis: Perform reverse transcription using constant region (Cγ or Cμ) specific primers.
  • Primary PCR: Amplify IgH V(D)J regions using multiplexed V-gene forward and C-region reverse primers with sample barcodes.
  • Library Preparation: Purify amplicons, size-select, and attach sequencing adapters.
  • Sequencing: Run on chosen Illumina platform to achieve ≥50,000 reads per sample.
  • Pre-processing: a. Demultiplex reads by sample barcode. b. Assemble paired-end reads using pRESTO. c. Filter for quality (Phred score ≥ Q30). d. Annotate V, D, J genes and CDR3 regions using IMGT/HighV-QUEST. e. Collapse identical sequences to unique molecular identifiers (UMIs) to correct for PCR error.
  • Output for ClonalTree: Generate a FASTA file of unique, productive IgH sequences with associated read counts and metadata (time point, isotype, subject ID).

Protocol: ClonalTree MST Generation and Analysis

Objective: To construct and interpret minimum spanning trees of B cell lineages.

Materials:

  • Input Data: Processed FASTA file from Protocol 2.1.
  • Clustering Tool: Change-O (DefineClones.py) for initial clonal clustering based on V/J gene identity and CDR3 similarity.
  • ClonalTree Algorithm: Custom implementation (R or Python) for MST generation.
  • Visualization Software: Graphviz (for rendering), ggtree (R) for annotation.

Procedure:

  • Clonal Clustering: Use DefineClones.py (Change-O suite) to group sequences into putative clones (same V gene, J gene, and CDR3 length with ≥85% nucleotide identity).
  • Multiple Sequence Alignment: For each clone, perform a nucleotide alignment of all sequences (e.g., using Clustal Omega).
  • Distance Matrix Calculation: Compute Hamming distances (number of nucleotide differences) between all sequences within a clone.
  • MST Construction: Apply Prim's or Kruskal's algorithm to the distance matrix to build the minimum spanning tree. The algorithm: a. Starts with a random sequence as the initial node. b. Iteratively adds the edge (connection) with the smallest mutational distance that connects a new, unconnected sequence to the growing tree. c. Continues until all sequences in the clone are connected in a single, cycle-free network.
  • Tree Annotation & Rooting: Annotate nodes with metadata (time point, isotype, cell subset). Root the tree using the inferred germline sequence (generated from the V and J gene alleles).
  • Metric Extraction: a. Tree Size & Depth: Number of nodes, longest path from root. b. Branching Complexity: Average node degree. c. Mutation Load: Average SHM along edges. d. Convergence Detection: Compare tree topologies across subjects for similar branching patterns from distinct germlines.
  • Visualization: Export tree graph in DOT format for rendering (see Diagram 1).

Visualization

Diagram 1: ClonalTree MST Workflow from BCR Seq to Lineage Tree

G cluster_0 BCR Sequencing & Processing cluster_1 ClonalTree MST Pipeline cluster_2 Output & Analysis BCR_Seq BCR Seq (FASTQ) PreProc Quality Filter & UMI Collapse BCR_Seq->PreProc Annot V(D)J Annotation (IMGT) PreProc->Annot Cluster Clonal Clustering (DefineClones.py) Annot->Cluster Align Multiple Sequence Alignment Cluster->Align Dist Calculate Hamming Distance Matrix Align->Dist MST Build Minimum Spanning Tree Dist->MST TreeVis Lineage Tree Visualization MST->TreeVis Metrics Extract Lineage Metrics MST->Metrics End Insights into Affinity Maturation TreeVis->End Metrics->End Start Sorted B Cells (e.g., HA-specific) Start->BCR_Seq

Diagram 2: Key B Cell Maturation Pathways in Vaccine Responses

G Germline Germline B Cell Activated Activated B Cell in Germinal Center Germline->Activated Antigen Encounter Path1 Differentiation Path Activated->Path1 Signal 1: IL-21/STAT3 Path2 Affinity Maturation Path Activated->Path2 Signal 2: CD40/BAFF PC Plasmablast (Short-lived, Secretes Ab) Path1->PC Mem Memory B Cell (Long-lived, Archived) Path1->Mem HighAff High-Affinity Variant Path2->HighAff Positive Selection LowAff Low-Affinity Variant (Pruned) Path2->LowAff Negative Selection HighAff->PC HighAff->Mem

The Scientist's Toolkit

Table 2: Essential Research Reagent Solutions for B Cell Lineage Studies

Reagent / Tool Vendor Examples Function in B Cell Lineage Analysis
Fluorescent Antigen Probes Recombinant HA (Influenza) or Env (HIV) trimer, biotinylated & coupled to streptavidin-PE/APC. FACS sorting of antigen-specific B cells from PBMC samples for targeted sequencing.
Single-Cell BCR Amplification Kits 10x Genomics Chromium Immune Profiling, SMARTer Human BCR. Enables paired heavy-light chain sequencing and recovery of full-length V(D)J from single cells, crucial for defining lineage members.
BCR Sequencing Primers In-house designed multiplex V-region primers; Commercial (iRepertoire). Amplifies the diverse IgH V gene repertoire for NGS library preparation.
Clonal Clustering Software Change-O, VDJtools. Groups sequencing reads into clonal families based on V/J gene and CDR3 similarity, the prerequisite for lineage tree building.
Phylogenetic Tree Algorithms ClonalTree (custom MST), IgPhyML, dnaml (PHYLIP). Reconstructs the evolutionary relationships and mutation paths within a B cell clone.
Graph Visualization Library Graphviz (DOT language), ggtree (R). Renders complex minimum spanning trees and lineage diagrams for publication and analysis.
Germline Inference Tool IMGT/GENE-DB, partis. Identifies the most likely unmutated common ancestor germline sequence for a clone, used to root lineage trees.

Conclusion

The ClonalTree minimum spanning tree algorithm offers a computationally efficient and intuitively appealing method for reconstructing B cell lineages, providing critical insights into the dynamics of adaptive immune responses. While it excels in clarity and speed for large datasets, researchers must be mindful of its assumptions regarding purely tree-like evolution. The choice between ClonalTree and more complex phylogenetic methods hinges on the specific research question, data quality, and computational resources. Future directions include integrating single-cell BCR and transcriptomic data, developing hybrid models to account for convergent evolution, and applying these refined lineage trees to accelerate the rational design of vaccines and therapeutic antibodies, ultimately bridging computational immunology with clinical translation.