Decoding B Cell Lineages: A Comprehensive Guide to HILARy Clonal Family Inference from Repertoire Sequencing Data

Layla Richardson Jan 12, 2026 419

This article provides a detailed guide to HILARy (Hierarchical Clustering for Lineage Analysis of Repertoires), a computational method for inferring B cell clonal families from high-throughput immune repertoire sequencing (Rep-Seq)...

Decoding B Cell Lineages: A Comprehensive Guide to HILARy Clonal Family Inference from Repertoire Sequencing Data

Abstract

This article provides a detailed guide to HILARy (Hierarchical Clustering for Lineage Analysis of Repertoires), a computational method for inferring B cell clonal families from high-throughput immune repertoire sequencing (Rep-Seq) data. Aimed at researchers and drug development professionals, we cover the foundational principles of B cell receptor diversification and the necessity of clonal inference. We then detail HILARy's methodological workflow, from data preprocessing to phylogenetic tree construction, and its applications in vaccine response tracking and autoimmune disease research. A dedicated troubleshooting section addresses common data quality and algorithmic challenges, while a comparative analysis validates HILARy against tools like partis and Change-O. The conclusion synthesizes best practices, highlights the translational impact on therapeutic antibody discovery and personalized medicine, and outlines future computational and experimental directions.

Understanding B Cell Clonality: Why HILARy Inference is Fundamental to Immunology Research

Application Notes

This document details experimental frameworks for studying B cell clonal dynamics, with data integrated into the HILARy (Hierarchical Inference of Lineage and Affinity from Repertoires) computational pipeline. The goal is to infer clonal families from B cell receptor (BCR) repertoire sequencing (RepSeq) data to elucidate the biological processes of somatic hypermutation (SHM), clonal expansion, and affinity maturation—critical for vaccine and therapeutic antibody development.

Key Quantitative Benchmarks in Affinity Maturation Studies The following table summarizes typical quantitative outcomes from germinal center reactions and in vitro maturation experiments.

Parameter Typical Range/Value Experimental Context Relevance to HILARy Inference
SHM Rate (per bp per division) ~10⁻³ to 10⁻⁵ In vivo GC B cells Basis for phylogenetic tree construction.
Average Mutation Load (IgG) 10-30 nucleotides Memory B cells post-immunization Distinguishes naive from expanded clones.
Clonal Expansion Factor 1x to >10,000x Antigen-specific B cell numbers Inferred from read depth and unique sequences.
Affinity Increase (Kd) 10 nM to 10 pM (100-10,000x) After 4-6 rounds of in vitro maturation Validates functional outcome of inferred lineages.
Productive Rearrangement % ~33% (1 in 3) From bulk BCR-Seq Filters non-functional sequences pre-inference.
Dominant Clone Frequency Can exceed 50% of antigen-specific response Response to protein antigens Identifies key lineages for therapeutic ablation.

HILARy Integration Context: The quantitative data above provides prior expectations for the algorithm. For instance, mutation rates inform the nucleotide substitution model, while expansion factors help differentiate true clonal expansion from PCR duplication artifacts.

Experimental Protocols

Protocol 1: Antigen-Specific B Cell Isolation and BCR Repertoire Sequencing

Objective: Generate high-fidelity BCR heavy-chain (IGH) repertoire data from antigen-specific B cells for HILARy lineage inference.

Materials:

  • Antigen Conjugates: Biotinylated antigen of interest for fluorescent tagging.
  • Magnetic/Cell Sorting: Streptavidin-coated magnetic beads or FACS Aria.
  • Cell Lysis Buffer: For single-cell RNA/DNA extraction.
  • Reverse Transcription Primers: Oligo-dT or gene-specific primers for Ig constant regions.
  • Multiplex PCR Primers: V-gene family-specific forward primers and isotype-specific reverse primers.
  • High-Fidelity DNA Polymerase: e.g., Q5 or KAPA HiFi to minimize PCR errors.
  • Next-Generation Sequencing Platform: Illumina MiSeq/Novaseq with 2x300bp kits for full-length VDJ.

Methodology:

  • Cell Staining & Sorting: a. Suspend PBMCs or splenocytes in FACS buffer. b. Stain with fluorescently labeled (e.g., PE) antigen-biotin-streptavidin complex. c. Include antibodies for B cell markers (CD19+, CD20+) and exclusion markers (CD3-, CD14-). d. Sort double-positive (antigen+, CD19+) single B cells into 96-well plates containing lysis buffer. Sort an equivalent number of antigen-negative B cells as a control.
  • Single-Cell BCR Amplification: a. Perform reverse transcription using a primer targeting the Ig constant region. b. Conduct a first-round multiplex PCR using V-gene family primers. c. Perform a second-round nested PCR with barcoded Illumina adapters to add unique molecular identifiers (UMIs). d. Purify amplicons.

  • Library Preparation & Sequencing: a. Quantify purified PCR products and pool equimolarly. b. Prepare sequencing library following platform-specific protocols. c. Sequence on an Illumina platform to achieve >1000x coverage per cell.

  • Data Processing for HILARy: a. Use tools like pRESTO or MiXCR for UMI-aware consensus assembly, V(D)J alignment, and error correction. b. Output a filtered, high-quality FASTA file of IGH VDJ nucleotide sequences with associated metadata (isotype, UMI count). c. This curated FASTA is the direct input for the HILARy inference pipeline.

Protocol 2: In Vitro Affinity Maturation & Kinetic Analysis

Objective: Validate the functional significance of inferred clonal lineages by expressing antibodies and measuring affinity improvements correlating with SHM patterns.

Materials:

  • Expression Vectors: Mammalian (e.g., HEK293) or prokaryotic (e.g., scFv phage display) systems.
  • Site-Directed Mutagenesis Kits: To revert or introduce specific mutations from lineage.
  • Surface Plasmon Resonance (SPR) Chip: CMS sensor chip for antigen immobilization.
  • BLI (Bio-Layer Interferometry) System: Alternative to SPR for kinetic measurements.

Methodology:

  • Antibody Expression: a. Synthesize and clone the VDJ sequences of putative progenitor and descendant antibodies from the HILARy-inferred tree into IgG1 expression vectors. b. Co-transfect heavy and light chain plasmids into Expi293F cells using a transfection reagent. c. Harvest supernatant after 5-7 days, purify antibodies using Protein A affinity chromatography.
  • Affinity Measurement via SPR: a. Dilute antigen to 10-50 µg/mL in sodium acetate buffer (pH 4.5-5.5) and immobilize on a CMS chip via amine coupling to reach ~100-200 Response Units (RU). b. Use HBS-EP+ (10mM HEPES, 150mM NaCl, 3mM EDTA, 0.05% v/v Surfactant P20, pH 7.4) as running buffer. c. Inject purified antibodies at 5 concentrations (e.g., 0.8 nM to 100 nM) over the antigen surface at a flow rate of 30 µL/min. d. Regenerate the surface with 10 mM Glycine-HCl (pH 2.0). e. Fit the resulting sensograms to a 1:1 Langmuir binding model using the Biacore Evaluation Software to calculate association rate (ka), dissociation rate (kd), and equilibrium dissociation constant (KD = kd/ka).

  • Data Correlation: a. Plot the calculated KD values against the mutational distance from the inferred germline sequence for each antibody. b. Statistically test (e.g., linear regression) for a correlation between increased affinity (lower KD) and branch length in the HILARy phylogenetic tree.

Mandatory Visualization

workflow Start Sample Collection (PBMCs/Spleen) Sort FACS: Antigen-Specific B Cell Isolation Start->Sort SeqPrep Single-Cell BCR Amplification & NGS Sort->SeqPrep DataProcess Sequence Processing (UMI consensus, VDJ alignment) SeqPrep->DataProcess HILARy HILARy Pipeline: Clonal Family Inference DataProcess->HILARy Output1 Phylogenetic Trees & Mutation Maps HILARy->Output1 Output2 Clonal Expansion Metrics HILARy->Output2 Validate Functional Validation (Antibody Expression, SPR) Output1->Validate Thesis Integrative Thesis on Clonal Dynamics Output1->Thesis Output2->Validate Output2->Thesis Validate->Thesis

Title: HILARy Repertoire Analysis and Validation Workflow

Title: Germinal Center Pathway for Affinity Maturation

The Scientist's Toolkit: Research Reagent Solutions

Item Function in SHM/Clonal Expansion Research
Biotinylated Antigen Critical for isolating antigen-specific B cells via streptavidin beads or FACS.
Anti-Human CD19/20 & Isotype Antibodies For positive selection of B lymphocytes and isotype switching analysis.
Single-Cell Lysis & RT Kit Preserves RNA from individual sorted B cells for accurate VDJ amplification.
Multiplex Ig Primer Sets Amplifies the full diversity of V genes from limited template for RepSeq.
Unique Molecular Identifiers (UMIs) Short random nucleotide tags that distinguish true biological sequences from PCR errors.
High-Fidelity Polymerase Essential for minimizing polymerase errors during library prep to accurately call SHMs.
IgG Expression Vectors Mammalian vectors for recombinant expression of inferred lineage antibodies.
Protein A/G Agarose For purification of recombinant IgG from culture supernatants for binding assays.
SPR/BLI Consumables Sensor chips and buffers for kinetic analysis of antibody-antigen interactions.
AID Inhibitor (e.g., HM13) Chemical probe to inhibit AID activity in vitro, validating its role in observed SHM.

Application Notes: Integrating HILARy Clonal Family Inference into the Rep-Seq Pipeline

Repertoire sequencing (Rep-Seq) generates vast datasets of adaptive immune receptor sequences. The core challenge is transforming these raw sequences into biological insights about clonal expansion, diversity, and antigenic drivers. The HILARy (High-performance Inference of Lymphocyte Antigen Reactivity) framework addresses a critical bottleneck: accurately inferring clonal families—groups of cells descended from a common progenitor—from noisy, high-throughput sequencing data. Accurate clonal grouping is the foundational step for all downstream analyses, including identifying pathogenic or therapeutic clones.

Table 1: Quantitative Comparison of Key Clonal Inference Methods

Method Core Algorithm Key Strength Primary Limitation Typical Runtime (1M reads)
HILARy Hierarchical clustering with probabilistic thresholding & phylogenetic refinement High accuracy in distinguishing true somatic hypermutation from PCR/sequencing error; integrates V/J gene identity. Computationally intensive for ultra-deep repertoires. ~60 minutes
Partis Hidden Markov Model (HMM) with expectation-maximization Simultaneously annotates V(D)J segments and clusters clones; models the recombination process. Can be complex to parameterize for non-standard species. ~45 minutes
Change-O Single-linkage clustering on Hamming distance Fast, simple, and highly customizable with user-defined thresholds. Accuracy highly dependent on user-selected, fixed distance thresholds. ~15 minutes
Decombinator Tag-based clustering followed by annotation Extremely fast initial clustering using unique molecular identifiers (UMIs) and core tags. Less effective for highly mutated sequences where core tags diverge. ~5 minutes

The HILARy protocol emphasizes a two-stage validation: first, in silico validation using simulated datasets with known ground truth; second, experimental validation using spike-in control clones or paired single-cell sequencing data to confirm clonal relationships.

Detailed Protocols

Protocol 1: HILARy-Based Clonal Family Inference from Raw FASTQ Files

Objective: To process raw Rep-Seq reads into annotated, clonally grouped data ready for immune repertoire analysis.

Materials & Reagent Solutions:

  • Raw Sequencing Data: Paired-end FASTQ files from platforms like Illumina MiSeq/NextSeq.
  • UMI-tagged Libraries: Essential for error correction and accurate PCR duplicate removal.
  • HILARy Software Suite: Available via GitHub, requires installation of dependencies (Python >=3.8, R >=4.0).
  • Reference Databases: IMGT/V-QUEST database for germline V, D, J gene alignment.
  • High-Performance Computing (HPC) Cluster: Recommended for full-scale analysis.

Procedure:

  • Preprocessing & Alignment:
    • Use pRESTO or MiXCR to perform quality filtering, paired-read assembly, and UMI-based consensus building.
    • Align consensus sequences to germline V, D, and J gene segments using IgBLAST or integrated alignment within partis.
  • Clonal Inference with HILARy:
    • Execute the primary HILARy algorithm: hilarity infer --input annotated_seq.json --output clusters.json.
    • The algorithm performs: a. Initial grouping by identical V and J gene assignments and CDR3 length. b. Hierarchical clustering within groups using a normalized Hamming distance metric on the CDR3 nucleotide sequence. c. Application of a model-based threshold that accounts for sequencing error and somatic hypermutation rate, rather than a fixed distance cutoff. d. Optional phylogenetic refinement to resolve ambiguous edges.
  • Post-processing:
    • Generate a clonal abundance table (counts per clone) and a detailed lineage report.
    • Export data in standard formats (.tsv, .json) for downstream tools like Immunarch or VDJtools.

Protocol 2: Experimental Validation of Inferred Clones via Single-Cell Sequencing

Objective: To validate computationally inferred clonal families using paired single-cell RNA-seq (scRNA-seq) with V(D)J enrichment.

Materials & Reagent Solutions:

  • Cryopreserved PBMCs or Tissue Sample: From the same donor as bulk Rep-Seq.
  • Chromium Controller & Kit (10x Genomics): For single-cell partitioning and library prep (e.g., Single Cell 5' v2 with Feature Barcoding).
  • Cell Ranger VDJ Pipeline: For processing single-cell V(D)J data.
  • Custom R/Python Scripts: For cross-modality data integration.

Procedure:

  • Single-Cell Library Preparation: Prepare libraries according to the 10x Genomics protocol for Gene Expression and V(D)J enrichment.
  • Sequencing & Data Processing: Sequence on an Illumina platform and process through the cellranger vdj pipeline to obtain per-cell clonotype calls.
  • Integration with HILARy Output:
    • Map the bulk Rep-Seq clones (from HILARy) to single-cell clonotypes by comparing CDR3 nucleotide sequences and V/J gene usage.
    • Validate that sequences HILARy grouped into one clone are found in single cells sharing the same clonotype.
    • Calculate validation metrics: Precision (What fraction of computationally inferred clones are confirmed by single-cell data?) and Recall (What fraction of single-cell clonotypes were captured by the bulk inference?).

Visualizations

G RawFASTQ Raw FASTQ Files (Paired-end + UMIs) Preproc Preprocessing & UMI Consensus RawFASTQ->Preproc Align V(D)J Alignment & Annotation Preproc->Align HILARy HILARy Clonal Inference Engine Align->HILARy Clusters Clustered Sequences (Clonal Families) HILARy->Clusters Downstream Downstream Analysis: Diversity, Selection, Tracking Clusters->Downstream

Title: Core Rep-Seq Workflow with HILARy

G Start Annotated Nucleotide Sequences Step1 1. Group by V gene, J gene, & CDR3 length Start->Step1 Step2 2. Hierarchical Clustering (CDR3 Nucleotide Distance) Step1->Step2 Step3 3. Model-Based Threshold (Error + SHM Model) Step2->Step3 Step4 4. Phylogenetic Refinement (Optional) Step3->Step4 End Final Clonal Family Assignments Step4->End

Title: HILARy Clonal Inference Steps

The Scientist's Toolkit: Research Reagent & Software Solutions

Table 2: Essential Resources for Rep-Seq Analysis

Item Function/Description Example Vendor/Resource
UMI Adapters Unique Molecular Identifiers linked to each starting molecule during library prep, enabling accurate error correction and removal of PCR duplicates. IDT, Twist Bioscience
IMGT Database The international reference for immunoglobulin and T-cell receptor germline gene sequences. Critical for accurate V(D)J alignment. IMGT.org
IgBLAST Standard tool for aligning antigen receptor sequences to germline V, D, and J genes. NCBI
pRESTO Toolkit Suite of Python scripts for processing raw Rep-Seq reads (quality control, assembly, UMI handling). pRESTO on GitHub
MiXCR Comprehensive, all-in-one software for Rep-Seq data analysis from raw reads to clonal quantification. MiXCR by Milaboratory
Immunarch R Package Powerful R package for downstream repertoire analysis, visualization, and diversity estimation. Immunarch on GitHub
10x Genomics Chromium Platform for generating paired single-cell gene expression and V(D)J data for experimental validation. 10x Genomics
Cell Ranger Official software suite for processing data from 10x Genomics single-cell V(D)J experiments. 10x Genomics

In B-cell and T-cell receptor (TCR) repertoire sequencing, a clonal family comprises a set of lymphocyte descendants originating from a single, antigen-naïve progenitor. Accurate clonal family inference is fundamental for studying adaptive immune responses, tracking clonal expansion in disease, and identifying targets for therapeutic development. This protocol details the core concepts and methodologies for defining clonal families within the context of the HILARy (Hierarchical Inference of Lymphocyte Antigen Receptor) framework, integrating V(D)J gene sharing, CDR3 similarity, and probabilistic germline reconstruction.

Key Concepts & Data

V(D)J Gene Segment Sharing

Clonal members originate from the same germline V, D, and J gene segments. Allelic differences must be accounted for.

Table 1: Criteria for V(D)J Gene Assignment in Clonal Grouping

Gene Segment Matching Requirement Common Allele Handling Method
V Gene Identical gene, allowing for allelic ambiguity IMGT database alignment with a 97-100% identity threshold.
D Gene Identical gene (highly permissive due to trimming) Required for BCR heavy chains and TCR β/δ chains. Often inferred.
J Gene Identiguous gene Critical for junctional boundary definition.

CDR3 Amino Acid Sequence Similarity

The complementarity-determining region 3 (CDR3) is the hypervariable core of the antigen-binding site. Clonal relatives exhibit highly similar CDR3 sequences.

Table 2: Common CDR3 Similarity Metrics & Thresholds

Metric Description Typical Clonal Threshold
Hamming Distance Count of amino acid substitutions. ≤ 2 for sequences of equal length.
Levenshtein Distance Count of insertions, deletions, and substitutions. ≤ 3-4, adjusted for sequence length.
Normalized Identity Score (Identical positions) / (alignment length). ≥ 0.85 (85% identity).

Germline Sequence Reconstruction

Inference of the original, unmutated germline V(D)J sequence of the founding B-cell is essential for studying somatic hypermutation (SHM) in B-cell lineages.

Table 3: Germline Reconstruction Algorithm Comparison

Algorithm/Tool Core Methodology Best For
Partis Hidden Markov Model (HMM) based Bayesian inference. High-accuracy BCR reconstruction with SHM.
IgPhyML Phylogenetic model incorporating selection and mutation. Evolutionary analysis of clonal trees.
SONAR Combined alignment and phylogenetic approach. TCR and multi-isotype analysis.

Application Notes & Protocols

Protocol 1: Initial Clonal Grouping via V(D)J and CDR3

Objective: Cluster raw repertoire sequencing reads into preliminary clonal families. Materials: Pre-processed, annotated sequence data (FASTQ/FASTA with VDJ assignments from IgBLAST, MixCR, or IMGT/HighV-QUEST). Procedure:

  • Gene-based Grouping: Partition all sequences into bins sharing identical V gene and J gene assignments.
  • CDR3 Length Filter: Within each bin, subgroup sequences by identical CDR3 nucleotide length.
  • Similarity Clustering: For each length-based subgroup, perform single-linkage clustering based on CDR3 nucleotide Levenshtein distance (e.g., threshold = 1).
  • Validation: Manually inspect clusters from high-frequency bins for potential over-splitting due to sequencing errors.

Protocol 2: HILARy-Enhanced Germline Reconstruction and Refinement

Objective: Reconstruct the germline progenitor sequence and refine clonal boundaries using a probabilistic model. Materials: Preliminary clonal clusters from Protocol 1. Procedure:

  • Input Preparation: For each preliminary cluster, extract multiple sequence alignment of V(D)J regions.
  • Germline Inference: Run the Partis algorithm (partis partition --infname input.csv) to simultaneously infer the most likely germline sequence and reassign sequences to clades based on a joint probability model of SHM and common ancestry.
  • Tree Construction: For each refined clonal family, build a phylogenetic tree using IgPhyML to visualize somatic evolution and validate lineage relationships.
  • Output: A final list of clonal families, each with a consensus germline sequence, all member sequences, and a phylogenetic tree.

Protocol 3: Validation by Synthetic Repertoires

Objective: Benchmark clonal inference accuracy using ground-truth synthetic data. Materials: Synthetic immune repertoire data (e.g., from ImmuneSIM or OLGA). Procedure:

  • Data Generation: Generate a synthetic repertoire with known clonal families, incorporating realistic mutation rates and sequencing error models.
  • Run Inference: Process the synthetic data through Protocols 1 and 2.
  • Calculate Metrics: Compare inferred families to ground truth using precision, recall, and F1-score.
    • Precision: (True Positives) / (All Inferred Family Members)
    • Recall: (True Positives) / (All True Family Members)

The Scientist's Toolkit

Table 4: Essential Research Reagent Solutions for Repertoire Sequencing & Clonal Analysis

Reagent / Material Function & Application
5' RACE Primer Systems Ensures capture of full-length V(D)J transcripts from RNA for unbiased repertoire prep.
Unique Molecular Identifiers (UMIs) Short random nucleotide tags attached to each cDNA molecule to correct for PCR amplification bias and sequencing errors.
High-Fidelity Polymerase Critical for accurate amplification of diverse receptor sequences with minimal error introduction.
IMGT Reference Directory The gold-standard database of germline V, D, and J gene alleles for alignment and annotation.
Spike-in Synthetic Standards Known sequences added to samples to quantify sequencing depth and calibrate error rates.
Barcode-Compatible Sequencing Kit Enables multiplexed, high-throughput sequencing of multiple samples on platforms like Illumina MiSeq/NextSeq.

Visualization of Workflows and Relationships

G start Raw Repertoire Sequencing Reads anno V(D)J Alignment & Annotation (IgBLAST/MixCR) start->anno group Preliminary Grouping: Identical V/J Gene & CDR3 Length anno->group cluster CDR3 Similarity Clustering group->cluster prelim Preliminary Clonal Clusters cluster->prelim hilar HILARy Refinement: Germline Reconstruction & Probabilistic Partitioning prelim->hilar final Final High-Confidence Clonal Families hilar->final output Output: Germline Sequences, Lineage Trees, Member Lists final->output

HILARy Clonal Inference Workflow

G germ Inferred Germline V(D)J Sequence node1 Clonal Member 1 (SHM Level: Low) germ->node1 Somatic Hypermutation node2 Clonal Member 2 (SHM Level: Medium) germ->node2 Somatic Hypermutation node3 Clonal Member 3 (SHM Level: High) germ->node3 Somatic Hypermutation pheno Phenotypic Output: Antibody/TCR Specificity node1->pheno Convergent Affinity Maturation node2->pheno Convergent Affinity Maturation node3->pheno Convergent Affinity Maturation

Clonal Family Evolution from Germline

Logic of Clonal Family Definition

Application Notes

HILARy (Hierarchical Inference of Lymphocyte Antigen-Reactivity) is a computational tool designed to infer clonal families from Adaptive Immune Receptor Repertoire Sequencing (AIRR-seq) data. Its core innovation lies in a multi-step hierarchical clustering approach that moves beyond single-linkage clustering on V/J gene identity and CDR3 length, integrating sequence similarity to define biologically relevant B-cell or T-cell receptor lineages.

Within the broader thesis of clonal family inference, HILARy occupies a critical niche. It addresses the inherent trade-off between specificity and sensitivity in lineage definition. Simpler methods may over-cluster dissimilar sequences or under-cluster highly mutated relatives. HILARy's hierarchical approach aims to balance these by constructing a tree of potential family relationships, allowing for dynamic cutoff selection based on sequence composition and mutation load.

Table 1: Comparison of Key Clustering Methods in AIRR-seq Analysis

Method Core Clustering Principle Key Inputs Primary Output Strengths Limitations HILARy Addresses
Single-linkage (CDR3-based) Pairs sequences by exact CDR3 AA identity & V/J gene. Nucleotide sequences, V/J calls. Groups of identical CDR3s. Simple, fast. Fails to group somatic variants; misses expanded families.
Network-based (e.g., SONAR) Connects nodes (sequences) based on graph distance thresholds. Aligned sequences, genetic distance. Network graphs of related sequences. Visualizes complex relationships. Can be computationally heavy; global threshold may not suit all families.
HILARy (Hierarchical) Agglomerative clustering with multiple, adaptive thresholds. V/J genes, CDR3 nucleotide, sequence alignment. Hierarchical tree with defined clonal groups. Adapts to mutation level; captures nuanced relationships. Computational cost higher than single-linkage.
Phylogeny-guided Builds phylogenetic trees from multiple sequence alignments. High-quality MSA, evolutionary model. Rooted phylogenetic trees. Models evolutionary history. Computationally intensive; requires curated input.

HILARy's algorithm typically proceeds through defined strata: 1) Primary grouping by V gene, J gene, and CDR3 length. 2) Secondary clustering within these groups based on nucleotide sequence similarity of the CDR3 region. 3) Iterative pairwise comparison and tree construction, often using a Hamming distance metric, to merge clusters. The final cluster assignment can be made by cutting the hierarchical tree at a distance threshold that may be informed by the estimated mutation rate.

Experimental Protocols

Protocol 1: Standard HILARy Clonal Family Inference from AIRR-seq Data

Objective: To process raw AIRR-seq data into defined clonal families using the HILARy hierarchical clustering approach.

I. Input Data Preparation

  • Starting Material: Demultiplexed FASTQ files from B-cell or T-cell receptor sequencing (e.g., IgH, TCRβ).
  • Sequence Annotation:
    • Use tools like IgBLAST, MiXCR, or IMGT/HighV-QUEST to align sequences and assign V, D, J genes, and define the CDR3 region.
    • Output must be in standardized AIRR-compliant format (e.g., .tsv) with columns for sequence_id, v_call, j_call, junction (CDR3 nucleotide), and junction_aa.
  • Data Filtering:
    • Remove sequences without a productive V-J assignment.
    • Remove sequences with stop codons within the CDR3.
    • Optional: Remove sequences with low read counts or perceived PCR errors (using tools like pRESTO).

II. HILARy Clustering Execution

  • Primary Clustering (Gene & Length Binning):
    • Group all sequences by identical v_call (or major allele), identical j_call, and identical nucleotide length of the junction field.
    • Output: Initial bins of sequences presumed related by common ancestry.
  • Hierarchical Agglomerative Clustering within Bins:
    • For each bin: a. Perform all-vs-all pairwise alignment of the junction nucleotide sequences. b. Calculate genetic distance (e.g., Hamming distance for equal length sequences). c. Construct a distance matrix. d. Apply an agglomerative hierarchical clustering algorithm (e.g., average-linkage) to the distance matrix to build a tree.
  • Tree Cutting & Cluster Definition:
    • Cut the hierarchical tree using a distance threshold (d). Common practice sets d ≤ 0.1 (10% nucleotide difference) for B-cell receptors to account for somatic hypermutation, but this is tunable.
    • Sequences within each resulting sub-tree are assigned a shared clone_id.

III. Post-processing & Validation

  • Lineage Consolidation: Review clusters for potential merging if sub-clusters share a common ancestor just beyond the strict cutoff, based on biological plausibility.
  • Output Generation: Create a final table with sequence_id, clone_id, and all annotation fields. Generate summary statistics: number of clones, clone size distribution, etc.
  • Validation: Perform basic sanity checks:
    • Visualize clone size distribution (should follow power-law).
    • Align sequences within a large clone to confirm shared mutations and common ancestry.

Protocol 2: Validating HILARy Clusters via Phylogenetic Analysis

Objective: To independently validate the biological relevance of a HILARy-inferred clonal family by constructing a maximum-likelihood phylogenetic tree.

  • Select Clone of Interest: Choose a large or biologically significant clone from HILARy's output.
  • Multiple Sequence Alignment (MSA): Extract the full V(D)J nucleotide sequences for all members. Perform a high-quality MSA using MAFFT or Clustal Omega.
  • Model Selection & Tree Building: Use IQ-TREE or RAxML to:
    • Find the best-fit nucleotide substitution model.
    • Construct a maximum-likelihood phylogenetic tree (with 1000 bootstrap replicates).
  • Comparison: Overlay the HILARy cluster assignment on the phylogenetic tree leaf nodes. A valid HILARy cluster should form a distinct, well-supported monophyletic clade on the phylogenetic tree, confirming its inference of common ancestry.

Diagrams

G Start Raw AIRR-seq FASTQ Files Annotate Sequence Annotation & Alignment (IgBLAST/MiXCR) Start->Annotate Filter Filter Productive Sequences Annotate->Filter Bin Primary Bin by V Gene, J Gene, CDR3 Length Filter->Bin Matrix Compute Pairwise Distance Matrix Bin->Matrix Cluster Hierarchical Agglomerative Clustering Matrix->Cluster Cut Cut Tree at Distance Threshold Cluster->Cut Output Clone Table & Summary Statistics Cut->Output

HILARy Clustering Workflow

HILARy Hierarchical Binning & Clustering

The Scientist's Toolkit

Table 2: Essential Research Reagent Solutions for HILARy Workflow

Item Function in HILARy/AIRR-seq Analysis Example/Note
AIRR-seq Library Prep Kit Prepares cDNA libraries from RNA/DNA of B/T cells for NGS, incorporating unique molecular identifiers (UMIs). Kits from 10x Genomics, iRepertoire, or Takara Bio. Essential for reducing PCR amplification bias.
IGH/TCR Reference Databases Provides germline V, D, J gene sequences for accurate alignment and annotation. IMGT, VDJServer databases. Critical for the first binning step in HILARy.
Sequence Annotation Pipeline Software that aligns raw reads to reference genes, identifies CDR3s, and assigns V(D)J genes. IgBLAST, MiXCR, IMGT/HighV-QUEST. Generates the structured input for HILARy.
HILARy Software Package The core executable or script that performs the hierarchical clustering algorithm on annotated sequences. Available via GitHub repositories or as part of larger toolkits like ImmuneDB or VDJtools.
High-Performance Computing (HPC) Environment Provides the computational resources for pairwise distance calculations and hierarchical clustering on large datasets. Local server cluster or cloud computing (AWS, Google Cloud). Necessary for scaling analysis.
Phylogenetic Analysis Suite Independent validation tool to assess the monophyly of inferred clusters. IQ-TREE, RAxML, PhyML. Used in Protocol 2 for biological validation.
AIRR Data Visualization Tool Software for visualizing clone size distributions, lineage trees, and sequence alignments post-HILARy. Alakazam, VDJviz, Immunarch. Helps interpret and present clustering results.

Accurate inference of B-cell and T-cell clonal families from repertoire sequencing (RepSeq) data is a cornerstone of modern immunology. Within the broader thesis on HILARy (High-Throughput Lymphocyte Antigen Receptor Analysis) clonal family inference, precise clonal tracking transforms raw sequencing data into biologically and clinically meaningful insights. This document outlines key research questions and provides detailed application notes and protocols that leverage accurate clonal inference.

Enabling Key Research Questions

Table 1: Core Research Questions Enabled by Accurate Clonal Inference

Research Domain Key Enabling Question Primary Output Metric Clinical/Basic Relevance
Basic Immunology How do antigen-driven selection pressures shape clonal lineage evolution over time? Normalized Shannon Entropy of clonal tree; Selection strength (dN/dS ratio) Basic
Vaccine Development What defines the breadth, potency, and durability of antigen-specific clonal responses? Clonal Expansion Index; Persistence time (weeks); Somatic Hypermutation (SHM) rate Translational
Autoimmunity & Cancer How do autoreactive or tumor-infiltrating lymphocyte (TIL) clones expand, diversify, and correlate with disease activity? Public clone frequency; Clone size skewness; T-cell receptor (TCR) convergence score Clinical
Immunotherapy Monitoring Which T-cell clones expand post-checkpoint blockade or CAR-T therapy and correlate with response/toxicity? Maximum clone frequency fold-change; Diversity pre/post (1-D Simpson Index) Clinical Biomarker
Infection & Immune Memory What is the clonal architecture of long-lived memory B/T cell pools following infection or vaccination? Memory/Naive clone ratio; Clonal genealogy depth (tree nodes); SHM burden Basic/Translational

Detailed Application Notes & Protocols

Protocol: Longitudinal Tracking of Antigen-Specific Clonal Dynamics

Objective: To quantify the expansion, contraction, and somatic evolution of antigen-specific B-cell clones following immunization.

Workflow Diagram:

G A Sample Collection (Peripheral Blood/ASP) B PBMC Isolation (Ficoll Gradient) A->B C B-cell Enrichment (CD19+ Magnetic Beads) B->C D RNA Extraction & cDNA Synthesis C->D E IGH Library Prep (Multiplex PCR) D->E F High-Throughput Sequencing (MiSeq) E->F G HILARy Pipeline: Clustering & Lineage Inference F->G H Time-Series Analysis: Clone Frequency, SHM, Tree Topology G->H I Correlation with Antigen-Specific ELISpot/Neutralization H->I

Title: Workflow for Tracking B-cell Clonal Dynamics

Materials & Reagents: Table 2: Research Reagent Solutions for Longitudinal Clonal Tracking

Item Function Example Product/Cat. No.
Lymphoprep Density gradient medium for PBMC isolation STEMCELL Technologies, 07801
CD19 MicroBeads, human Positive selection of B cells Miltenyi Biotec, 130-050-301
SMARTer Human B-Cell Receptor cDNA synthesis & amplification of IgH transcripts Takara Bio, 634414
MiSeq Reagent Kit v3 (600-cycle) High-throughput paired-end sequencing Illumina, MS-102-3003
HILARy Clustering Software V(D)J alignment, error correction, clonal family inference HILARy-C GitHub
ELISpot Kit (Antigen-Specific) Functional validation of identified clones Mabtech, HUMAN IFN-γ/IL-21

Procedure:

  • Sample Collection & Processing: Collect peripheral blood at multiple time points (e.g., Day 0, 7, 14, 28, 180). Isolate PBMCs using Lymphoprep per manufacturer's protocol.
  • B-cell Enrichment: Isolate CD19+ B cells using magnetic-activated cell sorting (MACS) with CD19 MicroBeads.
  • Library Preparation: Extract total RNA. Generate B-cell receptor (BCR) amplicon libraries using the SMARTer Human B-Cell Receptor kit, targeting the IGH variable region.
  • Sequencing: Pool libraries and sequence on an Illumina MiSeq platform using a 2x300 bp paired-end kit to achieve >100,000 reads per sample.
  • Clonal Inference Analysis:
    • Process raw FASTQ files through the HILARy pipeline:
      • Align reads to IMGT reference sequences.
      • Correct PCR and sequencing errors.
      • Clustering: Group sequences into clones using a hierarchical clustering algorithm based on V/J gene identity and CDR3 nucleotide similarity (default threshold: 0.85).
      • Lineage Tree Building: Construct maximum parsimony trees for each clone using SHM patterns.
  • Longitudinal Data Integration: Use custom R/Python scripts to track clone IDs across time points. Calculate Clone Frequency (% of total reads), SHM Burden (mutations per sequence), and Tree Complexity (number of nodes, branch length).
  • Functional Correlation: For clones of interest (e.g., high frequency, high SHM), synthesize recombinant antibodies for neutralization assays or correlate expansion with antigen-specific B-cell ELISpot data.

Protocol: Identifying Tumor-Reactive T-cell Clones for Biomarker Discovery

Objective: To identify and characterize tumor-infiltrating lymphocyte (TIL) clones that expand upon immune checkpoint inhibitor (ICI) therapy and correlate with clinical response.

Pathway Diagram:

G Therapy ICI Therapy (e.g., anti-PD-1) Tumor Tumor Microenvironment Therapy->Tumor Modulates CloneA Tumor-Reactive Clone A Tumor->CloneA Presents Neoantigens CloneB Bystander/Exhausted Clone B Tumor->CloneB Expansion Clonal Expansion CloneA->Expansion Proliferates CloneB->Expansion Limited Peripheral Detection in Peripheral Blood Expansion->Peripheral Clonal Emigration Response Clinical Response (e.g., Tumor Shrinkage) Expansion->Response Drives Peripheral->Response Correlates with

Title: Tumor-Reactive T-cell Clone Expansion and Response Pathway

Materials & Reagents: Table 3: Research Reagent Solutions for TIL Clonal Biomarker Discovery

Item Function Example Product/Cat. No.
Tumor Dissociation Kit, human Gentle enzymatic dissociation of solid tumors Miltenyi Biotec, 130-095-929
CD8+ T Cell Isolation Kit, human Enrichment of CD8+ T cells from TILs or PBMCs STEMCELL Technologies, 17953
TCRβ Kit for RNA-Seq Template-switch based TCR repertoire profiling Takara Bio, 634409
Cell Ranger V(D)J Primary analysis pipeline for TCR sequencing 10x Genomics, Software Suite
Clonotype Tracking Software (e.g., LICORN) Cross-sample clonotype matching & tracking LICORN
IFN-γ Secretion Assay Detection Kit Functional validation of reactive clones Miltenyi Biotec, 130-054-202

Procedure:

  • Sample Procurement: Obtain matched tumor biopsy (fresh or frozen) and peripheral blood samples pre-therapy and at an on-treatment timepoint (e.g., 6-12 weeks).
  • Single-Cell Suspension: Process tumor tissue using a human Tumor Dissociation Kit. Isolate PBMCs from blood via Ficoll gradient.
  • T-cell Enrichment (Optional): Isolate CD8+ T cells from TILs and PBMCs using negative selection kits.
  • TCR Sequencing Library Prep: For bulk analysis, extract total RNA and prepare TCRβ libraries using the Takara Bio kit. For single-cell resolution, use the 10x Genomics 5' Immune Profiling solution.
  • Clonal Inference & Tracking:
    • Process data: For bulk, use Cell Ranger V(D)J or MixCR. For single-cell, use the 10x Cell Ranger V(D)J pipeline.
    • Define clonotypes based on identical CDR3β amino acid sequences.
    • Use a clonal tracking tool (e.g., LICORN) to identify "Expanded Shared Clonotypes" present in both tumor and post-therapy blood, with a significant increase in frequency (>5-fold).
  • Biomarker Correlation: Calculate the Clonal Expansion Score (CES) = Σ(Frequency_post-blood of shared clones). Correlate CES with clinical metrics (RECIST response, progression-free survival) using statistical tests (e.g., Cox proportional hazards model).
  • Functional Validation: For top candidate clones, sort single T cells expressing the identified TCR, clone the TCRα/β genes, and express them in reporter cells. Test reactivity against autologous tumor organoids or peptide-MHC multimers.

Data Presentation and Analysis

Table 4: Example Quantitative Output from a Melanoma Anti-PD-1 Therapy Study

Patient ID Clinical Response Pre-Treatment TCR Richness Post-Treatment CES # of Expanded Shared Clones Max Clone Freq. in Blood (Post)
PT-01 Complete Response 45,623 0.087 12 2.41%
PT-02 Partial Response 38,451 0.041 5 1.22%
PT-03 Stable Disease 51,889 0.015 3 0.67%
PT-04 Progressive Disease 41,007 0.005 1 0.11%

Note: TCR Richness: Estimated number of distinct clonotypes; CES: Clonal Expansion Score.

Accurate clonal inference via methods like HILARy is not merely a computational task but a foundational tool that enables researchers to address profound questions in immunology and clinical oncology. The protocols outlined here provide a roadmap for translating repertoire sequencing data into insights about immune dynamics, with direct applications in developing prognostic biomarkers and monitoring therapeutic efficacy.

A Step-by-Step Workflow: Implementing HILARy for Clonal Family Inference in Practice

Within the broader thesis on HILARy clonal family inference from adaptive immune repertoire sequencing, accurate data preprocessing is the critical first step. This protocol details the preparation of raw FASTQ files from bulk or single-cell B/T cell receptor sequencing and their submission to IMGT/HighV-QUEST for comprehensive V(D)J gene annotation. Reliable annotation forms the foundation for downstream clonotype definition, lineage reconstruction, and somatic hypermutation analysis essential to the HILARy (Hierarchical Inference of Lymphocyte Antigen Receptor relationships) framework.

Research Reagent Solutions Toolkit

Table 1: Essential Reagents and Tools for Repertoire Sequencing Library Prep and Analysis

Item Function/Description
UMI-containing RT Primers Unique Molecular Identifiers (UMIs) enable PCR duplicate removal and accurate molecule counting, critical for quantitative clonal analysis.
Multiplex PCR Primers for V/Gene Families Primer sets designed to amplify the diverse V gene segments with minimal bias, often using multiplexed, semi-degenerate approaches.
High-Fidelity DNA Polymerase Essential for amplification with low error rates to minimize sequencing artifacts mistaken for somatic hypermutation.
Dual-Indexed Sequencing Adapters Allow for sample multiplexing and reduce index hopping artifacts in Illumina platforms.
Size Selection Beads (e.g., SPRI) For post-amplification clean-up and selection of correct amplicon size, removing primer dimers and large contaminants.
IMGT/HighV-QUEST The international reference tool for standardized V(D)J gene and allele assignment, junction analysis, and amino acid translation.
pRESTO / IgBLAST / MiXCR Alternative or complementary tools for initial read quality control, assembly, and local annotation.

Protocol: FASTQ File Preparation for IMGT Submission

Initial Quality Control and Demultiplexing

  • Raw Data Assessment: Using FastQC (v0.12.1), generate quality reports for all raw FASTQ files. Key metrics: Per-base sequence quality (Phred score >30 for core V(D)J sequence), adapter contamination, and sequence length distribution.
  • Demultiplexing: Use bcl2fastq (Illumina) or guppy_barcoder (Oxford Nanopore) to generate per-sample FASTQ files based on dual index reads. Verify expected read counts per sample.

Read Processing and Error Correction

This workflow is optimized for Illumina paired-end data with UMIs.

  • Merge Paired-End Reads: Use pRESTO (v0.7.1) AssemblePairs.py or PEAR to overlap R1 and R2, creating full-length amplicon sequences.
  • Identify and Annotate UMIs/Cell Barcodes: Extract UMI and cell barcode sequences from primer regions using pRESTO ParseHeaders.py. For single-cell data, associate reads with cell IDs.
  • Quality Filtering: Filter reads based on merged length (e.g., 250-550 bp for human IgG) and average quality score (Phred >30).

  • Deduplication by UMI: Group reads by UMI, align within groups, and build a consensus sequence to correct for PCR and sequencing errors. pRESTO's ClusterSets.py or UMI-tools can be used.
  • Primer/Constant Region Masking: Mask constant region and primer sequences prior to IMGT submission to avoid interference with V(D)J assignment.

Data Formatting for IMGT/HighV-QUEST

  • File Format Conversion: Ensure final sequences are in FASTA format. The header line should contain a unique sequence identifier.
  • Sequence Requirements: IMGT/HighV-QUEST requires nucleotide sequences of the rearranged V(D)J region. Ensure primers for V and J genes are trimmed/masked. The optimal length is between 250-500 nt.
  • Batch Splitting: For large datasets (>500,000 sequences), split FASTA files into batches of ≤ 300,000 sequences each, as per IMGT submission limits.

Protocol: Annotating V(D)J Genes with IMGT/HighV-QUEST

Online Submission and Parameter Selection

  • Access: Navigate to the IMGT/HighV-QUEST submission portal (https://www.imgt.org/HighV-QUEST/).
  • Upload: Upload the prepared FASTA file.
  • Parameter Configuration (Critical for HILARy):
    • Species and Receptor Type: Select the correct species (e.g., Homo sapiens) and molecule type (Immunoglobulin or TR).
    • Input Type: Choose "Rearranged nucleotide sequences."
    • Results Detail: Select "Detailed view (+AA junction, +V-REGIONs, ...)" to obtain full amino acid translations and V-region alignments required for somatic hypermutation analysis.
    • Allele Alignment Parameters: Use default parameters. The "Include results on alleles from the whole species" box is recommended for comprehensive allele assignment.

Interpretation of Key Output Files for Clonal Inference

Download the compressed result folder upon job completion. Key files include:

  • 1_Summary.txt: Overall statistics (Table 2).
  • 2_IMGT-gapped-nt-sequences.txt: Sequences with IMGT gapping (numbering for alignment).
  • 3_Nt-sequences.txt: V(D)J gene and allele assignments per sequence.
  • 6_Junction.txt: Details of the CDR3 nucleotide and amino acid sequence, including P/N nucleotide identification.

Table 2: Key Quantitative Metrics from IMGT/HighV-QUEST 1_Summary.txt

Metric Description Relevance to HILARy Analysis
Total submitted sequences Count of input FASTA entries. Baseline for preprocessing efficiency.
Identified V-D-J rearrangements Number of sequences with a productive V, (D), J assignment. Defines the starting set of potentially functional clones.
Productive sequences (%) Percentage of sequences in-frame with no stop codon. Primary filter for defining clonotypes.
V, D, J gene usage statistics Frequency of each gene segment. Identifies repertoire biases and informs prior probabilities.
Average mutation level (V-REGION) Mean number of nucleotide substitutions in the V gene. Central input for somatic hypermutation models in lineage construction.

Post-IMGT Processing for HILARy Input

  • Filter for Productive Sequences: Retain only sequences marked as "productive" in the IMGT output.
  • Extract Clonotype Signatures: For each sequence, define a preliminary clonotype key typically as: V_GENE + J_GENE + CDR3_AA_LENGTH. The exact nucleotide CDR3 sequence is used for precise grouping.
  • Collate Mutation Data: Parse the "V-REGION mutation" and "V-REGION identity %" fields to build the mutation matrix for sequences within each clonotype family.

Visualized Workflows

G RawFASTQ Raw Paired-End FASTQ Files QC FastQC Quality Control RawFASTQ->QC Merge Merge R1 & R2 (PEAR/pRESTO) QC->Merge Filter Quality Filter & Primer Masking Merge->Filter Dedup UMI-Based Deduplication Filter->Dedup FastaForIMGT Cleaned FASTA File Dedup->FastaForIMGT IMGT IMGT/HighV-QUEST Submission & Analysis FastaForIMGT->IMGT Results Annotation Results (V/J, CDR3, Mutations) IMGT->Results HILARy HILARy Clonal Family Inference Results->HILARy

Title: FASTQ to Annotated Data Workflow

IMGT/HighV-QUEST Analysis and Data Extraction Logic

G SubmittedFasta Submitted FASTA Sequences CoreProcess IMGT Alignment Engine (V, D, J Gene Assignment, Junction Analysis) SubmittedFasta->CoreProcess OutputFiles Structured Output Files CoreProcess->OutputFiles FilterProductive Filter for Productive Sequences OutputFiles->FilterProductive ClonotypeKey Generate Clonotype Key: V-Gene + J-Gene + CDR3-AA FilterProductive->ClonotypeKey MutationMatrix Collate V-Region Mutation Data FilterProductive->MutationMatrix HILARyInput Structured Input for HILARy Pipeline ClonotypeKey->HILARyInput MutationMatrix->HILARyInput

Title: IMGT Data Processing for HILARy Input

Within the broader thesis on advanced clonal family inference from Adaptive Immune Receptor Repertoire Sequencing (AIRR-seq) data, the HILARy (Hierarchical clustering based on nucleotide distance and germline proximity) algorithm represents a pivotal methodological advancement. It addresses the critical challenge of accurately grouping B-cell or T-cell receptor sequences into clonally related families—a foundational step for analyzing immune repertoire dynamics, somatic hypermutation patterns, and antigen-specific responses in vaccine development, oncology, and autoimmune disease research.

HILARy operates on two primary distance metrics, integrated into a hierarchical clustering framework.

Table 1: Core Distance Metrics in HILARy

Metric Description Calculation Purpose in Clustering
Nucleotide Distance Edit distance between the nucleotide sequences of Complementarity-Determining Region 3 (CDR3). Hamming or Levenshtein distance, often normalized by CDR3 length. Groups sequences with minimal somatic mutation divergence, indicating recent common ancestry.
Germline Proximity Distance between the inferred germline Variable (V) and Joining (J) gene segments. Boolean or weighted score based on identity of V and J gene assignments from IMGT/VDJdb. Groups sequences that originate from the same germline rearrangement event, a prerequisite for clonality.

The algorithm typically employs an agglomerative hierarchical clustering approach, where sequences are initially individual clusters and are iteratively merged based on a composite distance measure combining the above metrics, until a user-defined threshold is reached.

Table 2: Typical HILARy Parameter Thresholds (from Literature)

Parameter Typical Range Impact on Clustering
Maximum CDR3 Nucleotide Distance 0.10 - 0.15 (normalized) Lower value creates more, smaller clusters (strict). Higher value creates fewer, larger clusters (permissive).
V/J Gene Match Requirement Must share identical V and J genes Strict enforcement ensures only sequences from the same rearrangement are clustered.
Linkage Method (Agglomerative) Single, Complete, or Average Single linkage may chain distant sequences; Complete linkage is more conservative.

Application Notes & Protocols

Protocol 3.1: Input Data Preparation for HILARy Analysis

Objective: To process raw AIRR-seq data into the structured input required for the HILARy algorithm. Workflow:

  • Sequence Processing: Use pipelines like pRESTO, Immcantation, or MiXCR to:
    • Demultiplex raw reads.
    • Perform quality filtering and merging (for paired-end reads).
    • Identify and correct PCR/sequencing errors.
  • V(D)J Assignment: Annotate each high-quality sequence with its germline V, D (if applicable), and J genes using a tool like IgBLAST against the IMGT reference database.
  • CDR3 Extraction: Precisely extract the nucleotide and amino acid sequence of the CDR3 region based on conserved motifs (e.g., cysteine at 104, tryptophan/phenylalanine at 118, IMGT numbering).
  • Formatting: Compile the following into a tab-separated value (.tsv) file:
    • Sequence ID
    • Nucleotide CDR3 sequence
    • Assigned V gene
    • Assigned J gene
    • (Optional) Read count or UMI count.

Protocol 3.2: Executing HILARy Clustering

Objective: To cluster preprocessed sequences into clonal families. Software: Implement HILARy via custom scripts (Python/R) or within platforms like scirpy (for single-cell TCR data). Methodology:

  • Distance Matrix Computation:
    • For all sequence pairs within the same sample that share identical V and J gene assignments, calculate the normalized nucleotide distance between their CDR3s.
    • Store results in a pairwise distance matrix. Pairs with different V/J genes are assigned an infinite distance.
  • Hierarchical Clustering:
    • Apply agglomerative hierarchical clustering (e.g., scipy.cluster.hierarchy.linkage) using the precomputed distance matrix and a specified linkage method (average recommended).
  • Cluster Formation:
    • Cut the resulting dendrogram at the specified nucleotide distance threshold (hclust cutree function or equivalent).
    • Assign each sequence to a clonal family (cluster) ID.
  • Output: A table mapping each sequence ID to its clonal family ID, along with cluster size and summary statistics.

Protocol 3.3: Validation and Downstream Analysis

Objective: To validate clustering results and perform biological interpretation. Methodology:

  • Internal Validation: Calculate cluster statistics (size distribution, mean intra-cluster distance, silhouette score) to assess clustering quality.
  • Lineage Tree Reconstruction: For each large cluster, use tools like IgPhyML or dnaml to infer a maximum-likelihood phylogenetic tree from the aligned CDR3 nucleotide sequences. This visualizes somatic hypermutation pathways.
  • Convergence Analysis: Compare CDR3 amino acid sequences across clusters from different samples/individuals to identify public clones or convergent responses.
  • Phenotype Integration: (For single-cell data) Overlay clonal assignment onto UMAP/t-SNE plots and correlate with transcriptional clusters or cell surface protein expression.

Visualizations

G Start Raw AIRR-seq Reads QC Quality Control & Assembly Start->QC Annot V(D)J Germline Annotation QC->Annot Extract CDR3 Extraction Annot->Extract Filter Filter by V/J Identity Extract->Filter Dist Compute Pairwise CDR3 Nucleotide Distance Filter->Dist Cluster Hierarchical Clustering (Agglomerative) Dist->Cluster Cut Cut Dendrogram at Distance Threshold Cluster->Cut Output Clonal Family Assignments Cut->Output

Title: HILARy Algorithm Workflow

G cluster_0 Iterative Merging Process Root Unclustered Sequences (Leaf Nodes) C1 Cluster A (V1, J2, d=0.05) Root->C1 C2 Cluster B (V1, J2, d=0.08) Root->C2 C3 Cluster C (V3, J5, d=0.03) Root->C3 M1 Merge A & B (Min distance) C1->M1 C2->M1 M2 Final Cluster (V1, J2, d=0.10) C3->M2 M1->M2 Thresh Distance Cut Threshold = 0.15

Title: Hierarchical Clustering & Dendrogram Cutting

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Tools & Reagents for HILARy-Based Research

Item Function / Description Example / Provider
AIRR-seq Library Prep Kit Enables cDNA synthesis, multiplex PCR amplification of V(D)J regions, and addition of sequencing adapters/UMIs. Illumina TCR/BCR Solutions, Takara Bio SMARTer Human V(D)J, 10x Genomics Single Cell Immune Profiling
High-Fidelity DNA Polymerase Critical for accurate amplification with minimal PCR errors that could be mistaken for somatic mutations. KAPA HiFi HotStart, Q5 High-Fidelity (NEB)
UMIs (Unique Molecular Identifiers) Short random nucleotides added to each transcript during cDNA synthesis to correct for PCR amplification bias and sequencing errors. Integrated into commercial library prep kits.
IMGT/IGMT Database The international reference for immunoglobulin and T-cell receptor germline gene sequences. Essential for V(D)J assignment. https://www.imgt.org/
IgBLAST Software The standard tool for aligning sequence reads to germline V, D, J genes and identifying the CDR3 region. NCBI https://ncbi.github.io/igblast/
Immcantation Framework A comprehensive suite of open-source software (pRESTO, Change-O, IgPhyML) for AIRR-seq data analysis from start to finish. https://immcantation.readthedocs.io/
Scirpy Package A scalable Python toolkit for analyzing single-cell TCR and BCR data, including clustering and integrative analysis. https://scirpy.readthedocs.io/
High-Performance Computing (HPC) Cluster Necessary for processing large-scale repertoire datasets (millions of sequences) and performing intensive phylogenetic calculations. Local institutional HPC or cloud services (AWS, Google Cloud).

In the context of a thesis on HILARy (Heavy-Light Adaptive Repertoire) clonal family inference from B-cell receptor repertoire sequencing (RepSeq) data, clustering is a foundational step. The accurate grouping of nucleotide or amino acid sequences into clonal families—descendants of a common progenitor B cell—is paramount for understanding adaptive immune responses, identifying disease correlates, and informing therapeutic antibody discovery. The fidelity of this clustering hinges critically on two algorithmic parameters: the distance threshold (the maximum dissimilarity for sequences to be grouped) and the linkage criterion (the rule defining the distance between clusters). This document provides application notes and protocols for empirically determining these parameters to achieve optimal, biologically-relevant clustering.

Core Quantitative Metrics & Performance Benchmarks

The following table summarizes key performance metrics and common parameter ranges derived from current literature in BCR clonal clustering.

Table 1: Common Clustering Metrics, Parameters, and Their Interpretations

Metric/Parameter Typical Range/Value Description & Impact on Clustering
Hamming Distance Threshold Nucleotide: 0.10 - 0.15Amino Acid: 0.20 - 0.30 Maximum normalized allowed mismatch. Lower values increase specificity (reduce false mergers) but risk splitting true families. V(D)J mutation patterns guide selection.
Linkage Criteria Single, Complete, Average, Ward Single: Chain-sensitive, merges clusters based on nearest neighbors. Prone to chaining.Complete: Conservative, uses farthest neighbors. Produces compact clusters.Average: Balanced compromise. Often recommended for RepSeq.Ward: Minimizes within-cluster variance. Can be sensitive to outliers.
Calinski-Harabasz Index Higher is better. Ratio of between-cluster dispersion to within-cluster dispersion. Used to compare clustering quality across different parameter sets.
Average Silhouette Score -1 to +1 (Closer to +1 is better) Measures how similar an object is to its own cluster compared to other clusters. Useful for validating threshold choice.
Cluster Purity (vs. Ground Truth) 0.0 - 1.0 If a known ground truth (e.g., spike-in clones) exists, measures the fraction of correctly assigned sequences in each cluster.
Number of Inferred Clones Varies by sample depth & diversity The primary output count. Should be stable across reasonable parameter perturbations. Extreme sensitivity indicates overfitting.

Experimental Protocol: Determining Optimal Distance & Linkage

This protocol outlines a systematic approach for parameter tuning using a combination of internal validation metrics and, where available, biological validation.

Protocol Title: Empirical Optimization of Clustering Parameters for BCR Clonal Inference

Objective: To determine the optimal combination of sequence distance threshold and linkage criterion for hierarchical agglomerative clustering of BCR RepSeq data that yields biologically plausible clonal families.

Materials & Input Data:

  • Pre-processed BCR sequencing data (VDJ junctions or full-length sequences, aligned and corrected).
  • A computational environment with clustering libraries (e.g., scipy, sklearn in Python).
  • (Optional) Known spike-in control sequences or paired heavy-light chain data for validation.

Procedure:

  • Data Preparation:

    • For the region of interest (e.g., CDR3+V-region), compute an all-vs-all pairwise distance matrix. Common metrics include Hamming distance (for nucleotides) or Levenshtein distance (accounting for indels).
    • Normalize distances by sequence length.
  • Parameter Grid Definition:

    • Define a grid of values to test.
      • Distance Threshold (dt): e.g., [0.05, 0.10, 0.15, 0.20, 0.25, 0.30] for nucleotides.
      • Linkage Criteria (lc): ['single', 'complete', 'average'].
  • Clustering & Internal Validation Loop:

    • For each (dt, lc) combination in the grid:
      • Perform hierarchical agglomerative clustering on the distance matrix using linkage method lc.
      • Cut the resulting dendrogram at the distance threshold dt to form flat clusters.
      • Calculate internal validation metrics (e.g., Calinski-Harabasz Index, Average Silhouette Score) for the resulting clustering. Note the total number of clusters.
  • Analysis & Primary Selection:

    • Plot the validation metrics against the distance threshold for each linkage method.
    • Identify the "elbow" or plateau region in the Calinski-Harabasz curve and the peak in the Silhouette score. The threshold within this stable region is a candidate for optimality.
    • Assess the sensitivity of cluster count to small changes in dt; prefer a stable region.
  • Biological Validation (If Possible):

    • Using Paired Heavy-Light Chain Data: Under the HILARy thesis framework, the independent clustering of heavy and light chains followed by pairing provides a powerful validation. The optimal parameters should maximize the concordance where heavy and light chains from the same single cell fall into clusters that are frequently paired.
    • Using Spike-in Controls: If control clones are known, calculate cluster purity and completeness for each parameter set.
    • Select the (dt, lc) combination that maximizes biological validity metrics.
  • Final Application & Reporting:

    • Apply the selected optimal parameters to the full dataset.
    • Report the chosen parameters, the validation metrics that led to their selection, and the final cluster statistics.

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Tools and Reagents for Clonal Clustering Research

Item Function in Clustering Workflow
UMI (Unique Molecular Identifier)-based RepSeq Kit (e.g., 10x Genomics 5' VDJ, SMARTer) Reduces PCR and sequencing errors, providing accurate consensus sequences which form the reliable input for distance calculation.
Alignment & Annotation Tool (e.g, IgBLAST, MiXCR) Annotates V, D, J genes and CDR3 regions, enabling focused distance calculation on the relevant, hypervariable segments.
High-Performance Computing (HPC) Cluster or Cloud Instance Essential for computing large, all-vs-all distance matrices (O(n²) complexity) for deep repertoires (>100,000 sequences).
Python/R Libraries (scipy.cluster.hierarchy, scikit-learn, phyloseq) Provide optimized implementations of hierarchical clustering, distance metrics, and validation indices.
Synthetic BCR Control Libraries (Spike-ins) Provide ground truth clonal lineages for benchmarking and tuning clustering parameters in a specific experimental setup.
Single-Cell BCR Sequencing Data Serves as the gold-standard validation tool. The pairing of heavy and light chains from the same cell validates clonal family inferences made from bulk data.

Visualization of the Parameter Optimization Workflow

G palette Start Input: Pre-processed BCR Sequences DistMat Compute All-vs-All Distance Matrix Start->DistMat ParamGrid Define Parameter Grid (dt, lc) DistMat->ParamGrid ClusterLoop For each (dt, lc): 1. Hierarchical Clustering 2. Cut at dt ParamGrid->ClusterLoop Validate Calculate Validation Metrics (CH Index, Silhouette) ClusterLoop->Validate Analyze Analyze Metrics: Find Elbow/Plateau Validate->Analyze Aggregate Results BioValidate Biological Validation (e.g., H-L Pairing) Analyze->BioValidate BioValidate->ClusterLoop Refine Grid Select Select Optimal (dt, lc) BioValidate->Select Pass Apply Apply to Full Dataset Select->Apply

Diagram Title: Parameter Tuning Workflow for Clonal Clustering

Visualization of Linkage Criteria Impact on Cluster Formation

G cluster_single Single Linkage (Nearest Neighbor) cluster_complete Complete Linkage (Farthest Neighbor) cluster_average Average Linkage (Balanced) S1 S1 S2 S2 S1->S2 S3 S3 S2->S3 S4 S4 S3->S4 S5 S5 S4->S5 S6 S6 S5->S6 C1 C1 C2 C2 C1->C2 C3 C3 C6 C6 C4 C4 C5 C5 C4->C5 A1 A1 A2 A2 A1->A2 A3 A3 A2->A3 A4 A4 A5 A5 A4->A5 A6 A6

Diagram Title: Linkage Criteria Impact on Cluster Shape

This document provides detailed application notes and protocols for the downstream analysis of B cell or T cell receptor (BCR/TCR) clonal families, specifically following their inference using the HILARy (High-throughput Immune-system Lymphocyte Analysis via Repertoire sequencing) framework. The HILARy method enables the accurate grouping of repertoire sequencing reads into clonally related families originating from a common progenitor. The subsequent analysis of these families—through lineage tree reconstruction and mutation pattern dissection—is critical for understanding adaptive immune responses, the dynamics of somatic hypermutation (SHM), and for informing vaccine and therapeutic antibody development.

Key Analytical Objectives

The primary downstream objectives are:

  • Lineage Tree Reconstruction: To infer the phylogenetic relationship between sequences within a clonal family, depicting the hypothesized evolutionary history from a common unmutated ancestor.
  • Mutation Pattern Analysis: To quantify and characterize the nature of somatic mutations, including rates, spectra (transition/transversion biases), and targeting motifs (e.g., WRCH/DGYW hotspots).
  • Selection Pressure Analysis: To assess the evidence for antigen-driven selection within the clonal family by comparing observed replacement (R) and silent (S) mutation ratios in framework (FWR) and complementarity-determining regions (CDR).

Table 1: Core Quantitative Outputs from Downstream Clonal Analysis

Metric Description Typical Value/Range Interpretation
Clonal Family Size Number of unique sequences in the family 2 - 10⁴+ Indicates proliferation burst size.
Tree Depth Maximum number of mutations from root to leaf 1 - 30+ SHMs Reflects temporal extent or mutation intensity.
Tree Isomorphism Degree of branching (e.g., star-like vs. linear) Measured via Sackin index Informs on synchronous vs. asynchronous expansion.
SHM Rate Mutations per base pair in V region ~10⁻³ to 10⁻² Overall mutation load.
Transition:Transversion (Ts:Tv) Ratio Ratio of purinepurine/pyrimidinepyrimidine to other changes ~2.0 - 3.0 in mammals Reflects biochemical bias of AID enzyme.
R/S Ratio (CDR) Replacement to Silent mutation ratio in CDRs Often >2.9 Suggests positive antigenic selection.
R/S Ratio (FWR) Replacement to Silent mutation ratio in FWRs Often <1.5 Suggests purifying/structural selection.
Focusing Factor (R/S)CDR / (R/S)FWR >1 indicates selection Quantifies strength of antigen-driven selection.

Table 2: Essential Research Reagent Solutions & Tools

Item Function Example Product/Software
Multiple Sequence Alignment Tool Aligns nucleotide sequences of clonal members. Clustal Omega, MAFFT, IgBLAST.
Germline V/D/J Reference Provides inferred unmutated ancestor sequence. IMGT/GENE-DB, IgBLAST database.
Lineage Tree Building Algorithm Reconstructs phylogenetic trees from aligned sequences. dnaml (PHYLIP), IgPhyML, RAxML, neighbor-joining.
SHM Analysis Suite Quantifies mutations, spectra, and hotspots. Change-O, ShazaM, Immcantation framework.
Tree Visualization Software Renders and annotates lineage trees. ggtree (R), ETE Toolkit, FigTree.
High-Fidelity Polymerase For accurate amplification during library prep. KAPA HiFi, Q5.
UMI-labeled RT Primers For consensus sequencing to reduce PCR errors. Custom-designed primers.

Experimental Protocols

Protocol 4.1: Lineage Tree Reconstruction from a HILARy-Inferred Clonal Family

Objective: To generate a rooted phylogenetic tree depicting the somatic evolution of a B cell clone.

Materials:

  • Output from HILARy pipeline (a FASTA file of nucleotide sequences for a single clonal family).
  • IMGT/GENE-DB reference sequences.
  • Computing environment with IgBLAST, PHYLIP, and R installed.

Methodology:

  • Germline Reconstruction: For the clonal family FASTA, use IgBLAST with the -germline_db_V option against the IMGT database to identify the most likely germline V, D, and J genes. Use a tool like Change-O CreateGermlines.py to reconstruct the inferred, unmutated ancestral sequence.
  • Sequence Alignment: Add the inferred germline sequence to the FASTA file. Perform a multiple sequence alignment (MSA) using MAFFT (mafft --auto input.fasta > aligned.fasta). For BCRs, ensure alignment is codon-aware.
  • Tree Building: Use the aligned sequences (including germline as outgroup) to build a tree.
    • Option A (Maximum Likelihood - Recommended): Use IgPhyML (specialized for Ig sequences) or RAxML with a nucleotide substitution model (e.g., GTR+G).

    • Option B (Distance-based): Calculate a distance matrix (e.g., p-distance) and construct a tree via neighbor-joining using PHYLIP's dnadist and neighbor.
  • Rooting: Root the resulting tree using the inferred germline sequence as the explicit outgroup (the common ancestor).
  • Visualization & Annotation: Import the rooted tree file (Newick format) into R using the ggtree package. Ancode nodes by mutation count, and highlight sequences with shared mutations.

Protocol 4.2: Analysis of Somatic Hypermutation Patterns

Objective: To characterize the type, distribution, and selection pressure of somatic mutations.

Materials:

  • A clonal family alignment and its rooted lineage tree.
  • Annotation of CDR/FWR boundaries (from IMGT numbering).
  • R with ShazaM and dplyr packages.

Methodology:

  • Mutation Calling: Using the reconstructed germline and the aligned sequences, create a mutation map. The shazam function observedMutations calculates the number of R and S mutations per sequence.
  • Mutation Spectrum: Tabulate the count of each type of substitution (A>G, A>C, A>T, etc.). Calculate the overall Transition (Ts: A<>G, C<>T) to Transversion (Tv: all others) ratio.
  • Targeting Motif Analysis: For each mutation, extract the 5-nucleotide context (e.g., the WRC motif on the positive strand where AID preferentially acts). Use shazam to calculate the observed vs. expected mutation frequency in known hotspot motifs.
  • Selection Pressure (BASELINe): This is the gold-standard method.
    • Use the calcBaseline function in shazam to model the expected mutational probability for each CDR and FWR region based on the sequence's nucleotide content and mutability model (e.g., S5F).
    • Compare the observed R/S distributions to these expected null distributions to generate a Bayesian posterior distribution for selection strength.
    • A positive selection score (CDR) indicates antigen-driven selection; a negative score (FWR) indicates purifying selection.

Mandatory Visualizations

G cluster_A Tree Building Options Start HILARy Output (Clonal Family FASTA) G1 1. Germline Reconstruction Start->G1 G2 2. Multiple Sequence Alignment (MSA) G1->G2 G3 3. Phylogenetic Tree Building G2->G3 A1 IgPhyML/RAxML (Maximum Likelihood) A2 PHYLIP (Neighbor-Joining) G4 4. Tree Rooting (Germline as Outgroup) G3->G4 G5 5. Annotated Lineage Tree G4->G5 A1->G4 A2->G4

Lineage Tree Reconstruction Workflow

H Germline Germline Progenitor (e.g., IGHV3-23*01) Int1 Intermediate 1 3 SHMs Germline:p0->Int1:p0 A60G, T120C, G200A Int2 Intermediate 2 5 SHMs Germline:p0->Int2:p0 C48T, A95G G150A, T210C, A255G LeafA Sequence A 8 SHMs (R/S: 4/1) Int1:p0->LeafA:p0 G170A, C220T T260A (CDR2 R) LeafC Sequence C 7 SHMs (R/S: 2/3) Int1:p0->LeafC:p0 T100G, A250G LeafB Sequence B 10 SHMs (R/S: 5/2) Int2:p0->LeafB:p0 A80C, G180T (FWR S)

Example Annotated B Cell Lineage Tree

This Application Note details protocols for analyzing B cell receptor (BCR) repertoire sequencing data to track antigen-specific lineages and identify pathogenic clones. These methods are framed within the broader thesis of High-Inference Lineage Assembly and Reconstruction (HILARy) clonal family inference. HILARy provides a statistical framework for accurately grouping BCR sequences into clonal families based on V(D)J gene usage and junctional homology, which is the critical first step for downstream applications in vaccinology and autoimmunity research.

Application Note: Tracking Vaccine-Specific B Cell Lineages

Following vaccination, B cells recognizing the vaccine antigen undergo clonal expansion and somatic hypermutation. Tracking these lineages over time allows researchers to quantify the breadth, depth, and maturation of the humoral immune response.

Key Quantitative Findings from Recent Studies (2023-2024)

Table 1: Vaccine-Specific B Cell Lineage Dynamics

Parameter Influenza mRNA Vaccine (Study A) SARS-CoV-2 Booster (Study B) RSV Pre-F Vaccine (Study C)
Time to Peak Lineage Expansion 7-10 days post-vaccination 14 days post-booster 10-12 days post-vaccination
Avg. Clonal Family Size (Peak) 45 sequences 120 sequences 28 sequences
Avg. Lineage Mutation Rate (SHM) 8.2% 6.5% 5.1%
Persistence (>6 months) 12% of expanded lineages 25% of expanded lineages Data pending
Cross-Reactive Lineages 35% showed binding to historical strains 15% neutralized XBB.1.5 variant 60% bound both A & B RSV strains

Detailed Protocol: Enrichment and Sequencing of Antigen-Specific B Cells

Protocol 2.3.1: Antigen-Specific B Cell Sorting and BCR-Seq Objective: To isolate vaccine-antigen binding B cells and obtain paired heavy-light chain BCR sequences.

Materials:

  • PBMCs or lymphoid tissue from vaccinated subjects (pre-vax, day 7, day 14, day 28+).
  • Biotinylated vaccine antigen (e.g., spike protein, HA protein).
  • Fluorescent Streptavidin & B cell phenotyping antibodies (CD19, CD20, CD27, CD38, IgD).
  • Fluorescence-Activated Cell Sorter (FACS).
  • Single-cell RT-PCR kit for full-length V(D)J amplification (e.g., SMARTer Human BCR).
  • High-throughput sequencer (Illumina MiSeq/NextSeq).

Procedure:

  • Staining: Stain 10-20 million PBMCs with biotinylated antigen, followed by fluorescent streptavidin and phenotyping antibodies.
  • Sorting: Use FACS to sort single, live, antigen+ memory B cells (CD19+CD20+IgD-CD27+) and plasmablasts (CD19+CD20low/-CD27++CD38++).
  • Library Prep: Perform single-cell lysis and reverse transcription. Amplify full-length IgG heavy and light chain transcripts using V gene primer sets and template-switching.
  • Sequencing: Pool libraries and sequence on a 2x300bp MiSeq run to achieve high-quality, full-length coverage.

Detailed Protocol: HILARy-Based Clonal Lineage Inference

Protocol 2.4.1: Constructing Clonal Families from Sorted BCR-Seq Data Objective: To apply the HILARy framework for accurate clonal grouping and lineage tree construction.

Procedure:

  • Pre-processing: Use tools like pRESTO and Change-O for demultiplexing, quality filtering, and V(D)J assignment (IgBLAST).
  • Clonal Grouping (HILARy Core): a. Define initial clusters by identical IGHV and IGHJ genes and CDR3 nucleotide length. b. Calculate pairwise distances between CDR3 regions within each cluster using a modified Hamming distance. c. Apply a hierarchical clustering algorithm with a dynamic threshold that accounts for sequencing error and SHM. d. Merge clusters if the median distance between them is below the empirically derived threshold (typical range: 0.10-0.15).
  • Lineage Tree Construction: For each clonal family, align sequences (MAFFT) and construct a maximum-likelihood phylogenetic tree (IgPhyML) to model SHM and selection.
  • Downstream Analysis: Annotate trees with time points, calculate SHM rates, and identify convergent antibody sequences across donors.

G start Sorted Antigen-Specific Single B Cells lib Single-Cell BCR Library Prep start->lib seq High-Throughput Sequencing (2x300bp) lib->seq pre Pre-processing: Quality Filter, V(D)J Assign seq->pre hilar HILARy Clonal Inference (V/J Gene + CDR3 Distance) pre->hilar tree Phylogenetic Tree Construction (IgPhyML) hilar->tree output Output: Tracked Lineages with SHM & Temporal Data tree->output

Title: Workflow for Tracking Vaccine-Specific B Cell Lineages

Application Note: Identifying Pathogenic Clones in Autoimmunity

In autoimmune conditions like lupus (SLE) and rheumatoid arthritis (RA), self-reactive B cell clones escape tolerance. Identifying these pathogenic clones from bulk repertoire data is crucial for understanding disease mechanisms and developing targeted therapies.

Key Quantitative Findings from Recent Studies (2023-2024)

Table 2: Pathogenic B Cell Clones in Autoimmunity

Characteristic Systemic Lupus Erythematosus Rheumatoid Arthritis (Anti-Citrullinated Protein) Multiple Sclerosis
Typical Enrichment in Tissue Kidney (Lupus Nephritis): 5-15x vs blood Synovium: 20-50x vs paired blood CSF: 10-30x vs paired blood
Avg. SHM in Pathogenic Clones 11.5% 9.8% 8.2%
Clonal Family Size Large, often >100 sequences Moderate, 20-80 sequences Variable, often expanded in CSF
Recurrent V Gene Usage IGHV4-34 (anti-dsDNA) IGHV1-69/IGHV4-39 (anti-CCP) IGHV4-34, IGHV3-15
Evidence of Antigen Drive Strong (R/S ratio >3 in CDR) Strong (R/S ratio >2.8 in CDR) Moderate (R/S ratio ~2.5)

Detailed Protocol: Identifying Tissue-Restricted Pathogenic Clones

Protocol 3.3.1: Paired Tissue-Blood Repertoire Profiling and Analysis Objective: To identify clones expanded in diseased tissue compared to autologous blood, suggesting local antigen drive.

Materials:

  • Paired samples: Diseased tissue (e.g., kidney biopsy, synovial fluid) and peripheral blood.
  • Single-cell RNA-seq (scRNA-seq) kit with V(D)J enrichment (e.g., 10x Genomics 5' V(D)J).
  • Bioinformatic pipelines: Cell Ranger V(D)J, Seurat.

Procedure:

  • Sample Processing: Generate single-cell suspensions from tissue and blood. Isolate live CD19+ B cells.
  • Library Construction: Use 10x Genomics 5' gene expression with V(D)J enrichment kit according to manufacturer protocol.
  • Sequencing: Sequence libraries to a depth of ~50,000 reads per cell for gene expression and full coverage for BCR.
  • Integrated Analysis: a. Process data with Cell Ranger V(D)J and integrate gene expression (GEX) and BCR data using Seurat. b. Apply HILARy framework separately to tissue and blood BCR data to define clonal families. c. Identify tissue-restricted clones: Calculate a tissue enrichment score: (Clone size in tissue / Total tissue B cells) / (Clone size in blood / Total blood B cells). Clones with a score >10 and absolute presence >5 cells in tissue are flagged. d. Correlate clone phenotype via GEX: e.g., expression of pathogenic markers (e.g., TNF, IL6, ITGAX for age-associated B cells).

Detailed Protocol: Functional Validation of Pathogenicity

Protocol 3.4.2: Recombinant Antibody Expression and Autoreactivity Testing Objective: To confirm the autoreactivity of BCR sequences identified from pathogenic clonal families.

Procedure:

  • Clone Selection: Select 2-3 dominant sequences from each candidate pathogenic clonal family.
  • Recombinant Expression: Synthesize genes for heavy and light chain variable regions. Clone into human IgG1/kappa expression vectors. Co-transfect Expi293F cells using Expifectamine.
  • Purification: Harvest supernatant after 5 days. Purify IgG using Protein A affinity chromatography.
  • Binding Assays:
    • ELISA: Test binding to candidate autoantigens (e.g., dsDNA, citrullinated peptides, myelin basic protein).
    • Immunofluorescence: On HEp-2 cells or primary tissue sections.
  • Functional Assays: Test for complement deposition (C1q binding assay) or stimulation of reporter cells (e.g., NF-κB activation in HEK-Blue TLR9 cells by DNA-immune complexes).

G start1 Tissue B Cells (scRNA-seq/BCR) proc1 V(D)J Assembly & Clonal Grouping start1->proc1 start2 Paired Blood B Cells (scRNA-seq/BCR) proc2 V(D)J Assembly & Clonal Grouping start2->proc2 hilar1 Apply HILARy Framework comp Comparative Analysis: Tissue Enrichment Score hilar1->comp hilar2 Apply HILARy Framework hilar2->comp pheno Phenotype Correlation via GEX Data comp->pheno output Validated List of Pathogenic B Cell Clones proc1->hilar1 proc2->hilar2 val Functional Validation: Recombinant Ab Testing pheno->val val->output

Title: Workflow for Identifying Pathogenic Clones in Autoimmunity

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Reagents and Tools for B Cell Lineage & Pathogenic Clone Studies

Item Function/Application Example Product/Catalog
Biotinylated Antigens Label antigen-specific B cells for FACS sorting. Critical for vaccine studies. SARS-CoV-2 S-2P Trimer (Acro Biosystems); HA Proteins (Sino Biological)
Single-Cell BCR Amplification Kit Amplify paired heavy/light chains from single sorted B cells. SMARTer Human BCR Profiling Kit (Takara Bio)
10x Genomics 5' V(D)J + GEX Kit Integrated single-cell gene expression and V(D)J sequencing from tissue. 10x Genomics Chromium Next GEM Single Cell 5'
Human IgG Expression Vector For recombinant expression of candidate pathogenic or vaccine-derived antibodies. pFUSEss-CHIg-hG1, pFUSE2ss-CLIg-hk (Invivogen)
Expi293 Expression System High-yield transient expression of recombinant antibodies for validation. Expi293F Cells & Expifectamine (Thermo Fisher)
HILARy-Compatible Software Bioinformatic pipeline for robust clonal family inference. Custom R/Python scripts implementing HILARy algorithm (available on GitHub)
IgPhyML Phylogenetic software designed for modeling B cell lineage trees with SHM. IgPhyML (open source)

Overcoming Common Pitfalls: Best Practices for Optimizing HILARy Analysis and Data Quality

Within the broader thesis on High-Integrity Lymphocyte Antigen Receptor (HILARy) clonal family inference from adaptive immune receptor repertoire sequencing (AIRR-Seq), accurate delineation of clonally related B or T cell sequences is paramount. This process fundamentally relies on clustering nucleotide sequences derived from common progenitor lymphocytes. Two major technical artifacts—sequencing errors and PCR duplicates—severely distort the biological signal, leading to either over-fragmentation (false clusters due to errors) or over-merging (inflated clusters due to duplicates) of clonal families. This application note details their impacts and provides corrected, implementable protocols to ensure high-fidelity HILARy clonal inference for research and therapeutic discovery.

Quantitative Impact on Clustering Accuracy

Table 1: Impact of Artifacts on Clonal Clustering Metrics

Artifact Primary Effect on Clustering Typical Error Rate/Effect Size Impact on Inferred Clonal Frequency
PCR Duplicates Over-merging; reduces unique molecular count. Can constitute 20-80% of raw reads, depending on protocol. Can inflate frequency of dominant clones by >10-fold, skewing diversity indices.
Sequencing Errors (Substitutions) Over-fragmentation; creates artificial diversity. ~0.1-1% per base (NGS platforms). Creates low-frequency "phantom" clones, artificially increases richness.
Indel Errors (especially in CDR3) Severe over-fragmentation; disrupts reading frame. ~0.01-0.1% per base, but impact is catastrophic. Splits true clones into multiple, erroneous small families.
Chimeric PCR Products Creates false, hybrid sequences. Typically 0.5-2% of reads in multiplex PCR. Generates biologically implausible clusters, confounding lineage analysis.

Table 2: Comparative Performance of Correction Strategies

Strategy/Method Key Principle Duplex Consensus Required? Estimated Clustering Accuracy Recovery Computational Demand
Unique Molecular Identifiers (UMI) with network-based correction Deduplication via UMI sequence tags. Yes (optimal) >95% (for duplicates) High
UMI with simple clustering Basic UMI group consensus. No ~85-90% Medium
Read-based deduplication Identical nucleotide sequence merging. No Handles duplicates only; 0% error correction Low
Statistical error correction (e.g., Martin's Algorithm) Expectation-maximization on aligned reads. No ~80-90% (for errors) Medium-High
Hybrid: UMI + Statistical Correction Combines both approaches. Yes >95% (for both artifacts) Very High

Detailed Experimental Protocols

Protocol 2.1: UMI-Based Duplicate Removal and Error Correction for HILARy Inference

Objective: To generate high-fidelity, error-corrected consensus sequences for each original cDNA molecule prior to clonal clustering. Materials: See "Research Reagent Solutions" table. Workflow:

  • Library Preparation: Use a 5' RACE-based AIRR-Seq kit that incorporates double-stranded UMIs (e.g., 12bp randomers) during cDNA synthesis.
  • Sequencing: Perform paired-end sequencing (2x300bp MiSeq/2x150bp NextSeq) to ensure full coverage of the UMI, V-region primers, and the full CDR3.
  • Pre-processing:
    • Truncate reads at first base with Q<30. Trim primer/constant region sequences.
    • Extract and record UMI sequences from the read header or initial bases.
  • UMI Clustering & Consensus Building (Critical Step):
    • Align all reads to a reference IG or TR locus using a lightweight aligner (e.g, minimap2).
    • Group reads by (a) sample barcode, (b) gene primer ID, and (c) UMI sequence (allowing for a Hamming distance of 1-2 to account for UMI synthesis/PCR errors).
    • For each UMI-group, perform a multiple sequence alignment of the variable region.
    • Generate a duplex consensus: Require that mutations (relative to the majority) be present on both forward and reverse strands from different PCR products (evidenced by different paired-end UMIs) to be retained as a true variant. Otherwise, revert to the majority base.
    • Output one consensus sequence per original UMI group. The count of unique UMI groups represents the corrected molecular count.
  • Output: A FASTA/FASTQ file of consensus sequences for downstream V(D)J assignment and clonal clustering.

Protocol 2.2: Post-Sequencing Statistical Error Correction for Legacy Data

Objective: To correct sequencing errors in datasets lacking UMIs, enabling more accurate clustering. Materials: pRESTO or USEARCH suite, high-performance computing node. Workflow:

  • Initial Clustering: Cluster pre-processed reads at a permissive identity threshold (e.g., 96-98% nucleotide identity) using a greedy clustering algorithm (e.g., USEARCH -cluster_fast).
  • Build Multiple Sequence Alignment (MSA): For each cluster, perform a MSA (Muscle or MAFFT).
  • Error Model Application: Apply a probabilistic error model (e.g., within pRESTO's ClusterSets):
    • For each column in the MSA, the base with the highest quality-score-weighted frequency is identified as the "true" base.
    • Bases with low quality scores and low frequency in the column are deemed errors.
  • Generate Corrected Sequence: For each original sequence in the cluster, correct erroneous bases to the consensus of the cluster. Sequences that are predominantly error-filled may be discarded.
  • Re-cluster for Analysis: Use the corrected sequences for definitive clonal clustering at the operational threshold (e.g., 99% for IgH).

Visualization of Workflows

G A Raw AIRR-Seq Reads (With UMIs) B Pre-processing: Trim, Q30, Extract UMI A->B C Group Reads by Sample + Gene + UMI B->C D Align Reads Within UMI Group C->D E Apply Duplex Consensus Rule D->E F Generate Corrected Consensus Sequence E->F G Output for HILARy Clonal Clustering F->G H Legacy Reads (No UMIs) I Pre-processing: Trim, Quality Filter H->I J Permissive Clustering (96-98% Identity) I->J K Multiple Sequence Alignment per Cluster J->K L Apply Probabilistic Error Model K->L M Correct Sequences & Re-cluster at 99% L->M

Diagram 1: Two Pathways for Error/Duplicate Correction

H cluster_PCR PCR Amplification cluster_Seq Sequencing Start True Biological Clone (One B Cell) P1 Generation of PCR Duplicates Start->P1 S1 Introduction of Sequencing Errors P1->S1 Copies are sequenced Obs Observed Read Pool: Variants from One Clone S1->Obs Impact Impact on Clustering Obs->Impact Frag Over-Fragmentation (False Clusters) Impact->Frag Errors dominate Merge Over-Merging (Inflated Clone) Impact->Merge Duplicates dominate

Diagram 2: Artifact Origin & Impact on Clustering

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Reagents and Tools for High-Fidelity HILARy Prep

Item Function/Principle Example Product/Kit
UMI-Integrated cDNA Synthesis Kit Incorporates unique molecular identifiers at the earliest step to tag each original mRNA molecule. Takara Bio SMARTer Human BCR/Ig Profiling Kit; 10x Genomics 5' Immune Profiling.
High-Fidelity PCR Enzyme Mix Minimizes polymerase-induced errors during library amplification, preserving sequence integrity. KAPA HiFi HotStart ReadyMix; Q5 High-Fidelity DNA Polymerase.
Dual-Indexed UMI-Compatible Adapters Enables multiplexing and accurate pairing of reads back to sample and original molecule. Illumina TruSeq UD Indexes; IDT for Illumina UMI Adapters.
Specialized Analysis Suites Software toolkits designed for UMI processing, error correction, and AIRR-Seq analysis. pRESTO, Immcantation framework, MIXCR.
Spike-in Control Libraries Artificial sequences of known diversity and frequency to quantify duplication and error rates. ERCC RNA Spike-In Mix; Sequins synthetic genomes.

Within the thesis on HILARy (High-resolution Inference of Lymphocyte Antibody Repertoires) clonal family inference, a primary challenge is ensuring robust analysis from suboptimal input data. Low-quality samples, characterized by low read counts, high PCR error rates, or poor template integrity, and sparse repertoires, with limited clonal diversity or depth, can significantly skew clonal clustering, lineage tree construction, and somatic hypermutation analysis. This document provides application notes and protocols for diagnosing these issues and systematically adjusting analytical parameters in repertoire sequencing (RepSeq) pipelines to maintain biological fidelity.

Diagnostic Criteria and Decision Framework

Before parameter adjustment, accurate diagnosis of data issues is crucial. The following thresholds, derived from current literature and benchmark studies (2023-2024), guide initial assessment.

Table 1: Diagnostic Criteria for Low-Quality and Sparse Repertoire Data

Metric Optimal Range Warning Zone Action Required Zone Primary Impact on HILARy Inference
Total Sequencing Reads > 100,000 50,000 - 100,000 < 50,000 Reduced power for rare clone detection; unstable diversity metrics.
Reads per Unique Barcode > 10 5 - 10 < 5 Inability to confidently correct PCR/sequencing errors via consensus.
Inferred Template Count > 80% of reads 60% - 80% < 60% High noise-to-signal ratio; false unique variants inflate diversity.
Mean Phred Quality Score (Q30) ≥ 30 25 - 29 < 25 Increased base-calling errors, misassignment of SHM.
Clonal Richness (Chao1 Estimator) Study-dependent 50% below control 70% below control Sparse repertoire; clonal families may be artificially merged.
Minimum Spanning Tree (MST) Connectivity Well-connected, single component Multiple fragments Highly fragmented Lineage inference fails; SHM pathways are interrupted.

The decision to adjust parameters should follow a logical workflow.

G start Raw Repertoire Data qc QC Metrics in Action Zone? start->qc sparse Sparse Repertoire Diagnosed? qc->sparse Yes proceed Proceed to HILARy Inference qc->proceed No adj Apply Parameter Adjustments sparse->adj Yes sparse->proceed No validate Validate on Control/Hold-Out adj->validate validate->qc Metrics Still Failed validate->proceed Metrics Pass

Diagram Title: Decision Workflow for Parameter Adjustment

Protocol: Parameter Adjustment for Clustering and Error Correction

This protocol details steps for the immunoClust and Change-O pipelines, commonly used in HILARy frameworks.

Materials & Reagents

Table 2: Research Reagent Solutions & Computational Tools

Item Function/Description
UMI (Unique Molecular Identifier)-linked RepSeq Library Enables consensus-based error correction. Critical for low-quality inputs.
PhiX Control V3 (Illumina) Spiked-in during sequencing for quality monitoring and error rate calibration.
Synthetic Immune Repertoire Spike-ins (e.g., AIRRscape Control Set) External multiplex PCR controls for quantifying sensitivity and specificity of recovery.
High-Fidelity DNA Polymerase (e.g., Q5, KAPA HiFi) Used in library amplification steps to minimize PCR errors pre-sequencing.
immunoClust (v2.0+) Adaptive clustering algorithm; key for adjusting distance thresholds.
Change-O/Alakazam (v1.3.0+) Suite for calculating SHM, building lineages, and assigning clonal groups.
scRepertoire (v1.10.0) Useful for comparative visualization of sparse vs. dense repertoires.

Step-by-Step Protocol

A. Pre-processing and Quality Control Enhancement

  • Demultiplexing & UMI Assignment: Use pRESTO (v0.7.0+) with --align set to core (less stringent) for low-quality FASTQs.
  • Consensus Building: For samples with <5 reads/UMI, lower the --minqual threshold from default 20 to 15. Increase --minreads for forming a consensus from 2 to 3 to reduce spurious UMI groups.
  • Gene Assignment: In IgBLAST, for warning-zone samples, consider using the -num_alignments_V flag to report more germline V gene candidates (e.g., from 3 to 5) for ambiguous reads.

B. Adjusting Clonal Grouping (Clonal Family Inference) The core step for handling sparsity. Default nucleotide distance thresholds may be too stringent.

  • Calculate distance models: Run CreateGermlines (Change-O) to reconstruct germline sequences.
  • Run DefineClones.py with modified parameters:
    • For sparse repertoires, relax the distance threshold. If the default is a normalized Hamming distance of 0.15, incrementally increase to 0.18 or 0.20.
    • Use the --model ham (Hamming) instead of hs5f (5-mer substitution model) for shorter, noisier sequences.
    • Implement the --act set criterion (allelic clustering threshold) if V gene assignment is poor.
    • Critical: Always run a synthetic spike-in control with the same parameters to measure False Discovery Rate (FDR).

Table 3: Adjusted Clonal Grouping Parameters for Sparse Data

Parameter Default Value Adjusted Value (Sparse) Rationale
Distance Threshold 0.15 (Normalized) 0.18 - 0.22 Prevents over-fragmentation of related sequences with higher error load.
Linkage Method single average Reduces chaining effects in low-diversity samples.
Minimum Cluster Size 2 1 In very sparse data, singletons may be true, rare clones. Flag for later review.

C. Somatic Hypermutation (SHM) and Lineage Inference

  • SHM Calculation: In Alakazam, use observedMutations with sequenceColumn=sequence_alignment and germlineColumn=germline_alignment_d_mask. For low-quality data, apply a frequency=TRUE filter to ignore mutations seen in only one read.
  • Lineage Tree Building: Use Dowser with start==germline and min_seqs_per_node=1 (instead of 2) for fragmented families. Prioritize tree building via igraph layout for visualization.

Validation and Reporting

After adjustment, validation is mandatory.

  • Internal Validation: Compare clonal cluster size distribution (log-log plot) before and after adjustment. A persistent "bulge" at size=1 indicates unresolved sparsity.
  • External Validation: Use recovery metrics from synthetic spike-ins. Report both sensitivity (true positive rate) and precision (1 - FDR).
  • Reporting: Document all parameter changes in metadata. Use the AIRR (Adaptive Immune Receptor Repertoire) Community MinSEA standards for reporting.

H adj_params Adjusted Parameters hilar_inf HILARy Inference (Clustering, Lineages) adj_params->hilar_inf val_int Internal Validation (Size Distribution, Connectivity) hilar_inf->val_int val_ext External Validation (Spike-in Recovery, FDR) hilar_inf->val_ext pass Analysis Valid val_int->pass Meets Criteria fail Re-evaluate or Flag Data val_int->fail Fails val_ext->pass FDR < 0.05 Sensitivity > 0.8 val_ext->fail Fails

Diagram Title: Post-Adjustment Validation Pathway

Handling low-quality and sparse repertoires requires a disciplined, diagnostic approach. Adjusting analytical parameters—specifically relaxing clonal distance thresholds, modifying UMI consensus rules, and validating with external controls—allows for biologically plausible HILARy inference from suboptimal data. These protocols ensure that conclusions drawn about clonal dynamics, vaccine response, or biomarker discovery remain robust despite technical data limitations. All adjustments must be transparently reported to maintain reproducibility.

Thesis Context: This document, part of a broader thesis on High-throughput Lymphocyte Receptor Analysis (HILARy) for clonal family inference from repertoire sequencing (RepSeq) data, addresses the critical challenge of ambiguous cluster assignments. These ambiguities arise when sequences, particularly those from converging or diverging lineages, exhibit distances that place them at the boundary of defined clonal clusters, complicating accurate lineage reconstruction and clonal tracking.

Quantification of Ambiguity in Clustering

Ambiguity typically manifests when the normalized Hamming or Levenshtein distance between a candidate sequence and two or more pre-defined clonal clusters falls within a poorly discriminant range. The following table summarizes key metrics and thresholds identified from recent literature for defining this "boundary region."

Table 1: Quantitative Boundaries for Ambiguous Cluster Assignment in B/T Cell Receptor Sequencing

Metric Typical Clonal Threshold Ambiguous Zone (Boundary) Common Cause & Implication
Nucleotide Hamming Distance ≤ 0.10 (10% divergence) 0.10 – 0.15 Convergent evolution or shared V-gene motifs; may indicate separate lineages with common ancestors.
Amino Acid Levenshtein Distance ≤ 0.20 0.20 – 0.25 Selection pressure leading to phenotypic convergence; risks merging functionally distinct clones.
SHM (Somatic Hypermutation) Load Clonal members: Similar SHM patterns Mismatch in SHM "hotspots" > 30% Sequences may be from temporally distinct responses (early vs. late germinal center); phylogenetic placement uncertain.
V/J Gene Identity Must be identical for same clone Same V gene, different J gene (or vice versa) Possible lineage relationship vs. independent recombination event.
Cluster Size (No. of Unique Sequences) Well-defined: > 5 members Singleton or doubleton sequences Could be technical artifact, highly expanded low-diversity clone, or true boundary case.

Strategic Framework for Resolution

A multi-algorithmic, evidence-weighted approach is required to resolve boundary cases. The logical workflow for this decision process is outlined below.

G Start Input: Boundary Sequence & Candidate Clusters Step1 1. Phylogenetic Context Check (Build NJ/ML tree with cluster members) Start->Step1 Dec1 Monophyletic within single cluster? Step1->Dec1 Step2 2. Ancestral State Reconstruction (Infer germline sequence) Dec2 Inferred germline identical to cluster GL? Step2->Dec2 Step3 3. Silent vs. Replacement Mutation Analysis (Calculate R/S ratio) Dec3 R/S ratio consistent with cluster selection pressure? Step3->Dec3 Step4 4. Multiparameter Evidence Weighting Out2 Assign to Cluster B (Ancestral & selective evidence) Step4->Out2 Dec1->Step2 No (Basal/Unclear) Out1 Assign to Cluster A (Strong phylogenetic support) Dec1->Out1 Yes Dec2->Step3 Yes Out3 Designate as Novel Singleton (Monitor for future related sequences) Dec2->Out3 No Dec3->Step4 Yes Out4 Flag for Manual Curation (Complex case) Dec3->Out4 No

Diagram Title: Decision Workflow for Boundary Sequence Assignment

Detailed Experimental Protocols

Protocol 3.1: Phylogenetic Context Validation for Boundary Sequences

Objective: To determine if a boundary sequence nests monophyletically within an existing clonal cluster or sits basally between clusters.

Materials: See Scientist's Toolkit (Section 5). Procedure:

  • Sequence Alignment: For each candidate cluster (A, B, etc.) and the boundary sequence, perform multiple sequence alignment (MSA) of the CDR3 nucleotide region plus 50 flanking bases using MAFFT (--auto preset).
  • Tree Building: Generate a Neighbor-Joining (NJ) tree from the MSA using IgPhyML (or FastTree for rapid assessment) with the HKY85 substitution model.
  • Clade Assessment: Visualize the tree (e.g., with FigTree). Assess if the boundary sequence:
    • Forms a distinct branch sister to a well-defined clade (cluster) with high bootstrap support (>70%).
    • Falls within a clade with strong support.
    • Sits on a long branch between clusters with low support.
  • Monophyly Test: Use the ape package in R to test if the boundary sequence and a candidate cluster form a monophyletic group to the exclusion of other clusters.

Protocol 3.2: Silent vs. Replacement Mutation Profiling

Objective: To compare the selection pressure profile of the boundary sequence with candidate clusters, as convergent selection can mimic relatedness.

Procedure:

  • Translate & Align: Translate nucleotide sequences to amino acids. Back-map the nucleotide sequences onto the amino acid alignment.
  • Identify Mutations: For the boundary sequence and the consensus of each candidate cluster, compare each to the inferred germline sequence (from partis or IgBLAST).
  • Categorize: For the FR and CDR regions separately, count:
    • Replacement (R) mutations: Nucleotide change alters the amino acid.
    • Silent (S) mutations: Nucleotide change does not alter the amino acid.
  • Calculate & Compare: Compute the R/S ratio for the boundary sequence. Perform a binomial test (using BASELINe or custom script) to determine if the observed R/S deviation from the expected neutral baseline (~3.0 for FRs, ~0.8-1.0 for CDRs) is consistent with the R/S profile of each candidate cluster.

Integrated Analysis Pipeline Diagram

G RawSeq Raw RepSeq Data (Boundary Reads) PreProc Pre-processing (Alignment, Error Correction) RawSeq->PreProc Alg1 Distance-Based Clustering (e.g., CD-HIT) PreProc->Alg1 Alg2 Graph-Based Clustering (e.g., SCOPe) PreProc->Alg2 Alg3 Model-Based Assignment (e.g., partis) PreProc->Alg3 Conflict Conflict Detection Module (Identifies Boundary Sequences) Alg1->Conflict Alg2->Conflict Alg3->Conflict Mod1 Phylogenetic Validation Module Conflict->Mod1 Mod2 Selection Pressure Analysis Module Conflict->Mod2 Mod3 Evidence Integration Engine Mod1->Mod3 Mod2->Mod3 Output Resolved Assignment (Curated Clonal Catalog) Mod3->Output

Diagram Title: HILARy Boundary Resolution Pipeline

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Tools for Ambiguity Resolution in Clonal Inference

Item / Solution Function in Protocol Key Consideration
IgBLAST (NCBI) Initial sequence annotation (V/D/J genes, SHM). Provides the foundational annotation for all downstream analysis; requires curated germline databases.
partis (https://github.com/psathyrella/partis) Probabilistic clustering, germline inference, and lineage modeling. Gold-standard for model-based assignment; computationally intensive but highly accurate.
Change-O Suite / IgPhyML Phylogenetic tree construction & selection pressure analysis. Specialized for immune repertoire data with models of SHM.
SCOPe (Single Cell Operator) Graph-based clustering using network analysis. Effective for identifying rare intermediates that bridge clusters.
ImmunoSEQ Analyzer (Adaptive Biotech) or VDJtools Commercial/open-source suite for clustering & diversity analysis. Provides standardized, reproducible pipelines for initial clustering and ambiguity flagging.
R/Bioconductor (alakazam, shazam) Calculation of distances, R/S ratios, and statistical testing. Essential for custom evidence weighting and visualization.
Synthetic Spiked-in Control Libraries (e.g., from iRepertoire) Distinguishing technical PCR/sequencing error from true biological variation. Critical for calibrating distance thresholds in the specific wet-lab protocol used.
Long-Read Sequencing (PacBio HiFi, Oxford Nanopore) Resolving complex haplotypes and phasing mutations. Ultimate empirical check for suspected boundary cases by providing full-length, phased sequences.

Application Notes

Efficient computational resource management is critical for HILARy (Hierarchical Inference of Lymphocyte Antigen Receptor families) clonal family inference from large-scale repertoire sequencing (Rep-Seq) datasets. The exponential growth in sequencing depth, often exceeding 1-10 million sequences per sample, presents significant challenges in runtime and memory footprint, directly impacting the scalability and feasibility of large cohort studies in vaccine and therapeutic antibody development.

Core Computational Challenges in HILARy Workflows

HILARy inference involves multiple computationally intensive steps: sequence quality filtering, V(D)J gene annotation, duplicate/error-aware clustering, lineage tree construction, and selection pressure analysis. Each stage has distinct resource profiles.

Table 1: Typical Computational Resource Requirements for Key HILARy Workflow Steps (Per 1 Million Sequences)

Workflow Step Approx. Runtime (CPU hrs) Peak Memory (GB) Primary Bottleneck
Preprocessing & QC 0.5 - 2 4 - 8 I/O, Compression
V(D)J Alignment 5 - 20 16 - 32 Heuristic Search
Clustering (Naive) 10 - 40 30 - 100+ All-vs-All Comparison
Lineage Tree Building 2 - 10 8 - 64 Graph Traversal
Selection Analysis 1 - 5 4 - 16 Statistical Computation

Optimization Strategies and Their Impact

Recent advances in algorithms and data structures offer substantial improvements.

Table 2: Impact of Optimization Strategies on Runtime and Memory

Optimization Strategy Implementation Example Typical Runtime Reduction Typical Memory Reduction
K-mer based pre-clustering Use of CDR3 k-mer sketches 40-70% 50-80%
Parallelized Alignment Multi-threaded IgBLAST/MMseqs2 60-85% (on 16 cores) +10-20% (per thread overhead)
Probabilistic Data Structures Bloom filters for unique sequence tracking ~30% 60-90%
Streaming Algorithms Single-pass clustering (e.g., Alignment-Free) 50-80% 70-95%
Sparse Matrix Operations For distance calculations in clustering 20-40% 70-85%

Detailed Experimental Protocols

Protocol A: Memory-Efficient Clonal Grouping for HILARy Inference

This protocol details a two-stage clustering approach designed to minimize memory use while maintaining accuracy for clonal family inference.

Materials & Reagents: See "The Scientist's Toolkit" below. Software: Python 3.9+, SciPy, NumPy, parasail library, khmer toolkit.

Procedure:

  • Input Preparation:
    • Start with error-corrected, V(D)J-aligned nucleotide sequences in AIRR-compliant TSV format.
    • Extract required fields: sequence_id, sequence_alignment, v_call, j_call, junction.
  • Stage 1 - K-mer Sketching & Partitioning (Memory Reduction):

    • For each unique junction sequence, generate a minimal perfect hash (e.g., using BBhash).
    • Encode each junction into a 4-bit-per-base integer array.
    • Apply a streaming k-mer counting algorithm (K=5) using a Count-Min Sketch data structure (depth=5, width=100000).
    • Partition sequences into "super-groups" based on shared V gene, J gene length, and presence of ≥2 high-frequency (count>3) k-mers. This step reduces the problem space by 80-90%.
  • Stage 2 - Exact Distance Clustering (Within Partitions):

    • For each super-group, load sequences into memory sequentially.
    • Calculate pairwise Levenshtein distances only within the super-group using a banded dynamic programming algorithm (bandwidth=7), optimized via SIMD instructions (parasail.nw_banded).
    • Perform hierarchical clustering using a single-linkage criterion with a Hamming distance threshold (typically 0.10-0.15 of junction length).
    • Output clonal cluster assignments to a new AIRR-formatted file.
  • Validation & Merge (Optional):

    • Perform a lightweight, all-vs-all comparison of cluster centroids across super-groups to merge rare cross-partition families.
    • This step uses a pre-computed index of centroids and is typically <5% of total runtime.

Expected Outcomes: This protocol processes 10 million sequences in under 6 hours using <32 GB RAM on a standard 16-core server, compared to >72 hours and >100 GB RAM for a naive all-vs-all approach.

Protocol B: Runtime-Optimized Parallel Alignment for Large Batches

This protocol leverages distributed computing for the V(D)J alignment step, often the initial bottleneck.

Procedure:

  • Data Chunking:
    • Split the raw FASTQ/FASTA file into smaller chunks of 100,000 sequences each using fastp or a custom Python script with gzip compression.
    • Record chunk boundaries and sequence IDs in a manifest file.
  • Distributed Alignment Job Submission:

    • Using a workload manager (e.g., SLURM, SGE), submit one array job per chunk.
    • Each job runs an alignment tool (e.g., IgBLAST, IMGT/HighV-QUEST) with a pre-built, localized reference database.
    • Critical: Ensure all jobs write temporary files to local node SSD storage, not network drives.
  • Result Aggregation & Deduplication:

    • As jobs complete, a master process aggregates results into a single AIRR TSV.
    • Use a persistent disk-based hash table (e.g., Redis or sqlite3) to identify and merge duplicate alignments for identical sequences across chunks during aggregation, preventing a final O(n²) step.

Diagrams

HILARy Optimization Workflow

hilary_opt cluster_c Clustering Detail Start Raw Rep-Seq Reads (FASTQ) Sub1 A. Parallel Chunking & Preprocessing Start->Sub1 1. Split & QC Sub2 B. Distributed V(D)J Alignment Sub1->Sub2 2. Align Chunks Sub3 C. Memory-Efficient Clustering Sub2->Sub3 3. Merge & Cluster Sub4 D. Streamlined Lineage Inference Sub3->Sub4 4. Build Trees C1 K-mer Sketch & Partition End Annotated Clonal Families Sub4->End C2 In-Partition Exact Comparison C1->C2 C3 Cross-Partition Merge Check C2->C3

Title: HILARy Optimization Pipeline

Memory vs Runtime Trade-off in Strategies

tradeoff cluster_0 Optimization Strategy Landscape Y1 High Memory Use Y2 Low Memory Use X1 Slow Runtime X2 Fast Runtime Naive Naive All-vs-All SimplePar Basic Parallelization Kmer K-mer Sketching Stream Streaming Algorithm Ideal Target Zone

Title: Strategy Trade-off Space

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Computational Tools & Resources for Rep-Seq Optimization

Item Name Primary Function Key Application in HILARy Optimization
MMseqs2 Ultra-fast protein/nt search & clustering Enables fast, sensitive pre-clustering of sequences before detailed alignment, reducing load on IgBLAST.
IgBLAST (w/ MPI) V(D)J sequence alignment The gold-standard aligner, parallelized with MPI to distribute queries across cores/nodes.
Bloom-filter Libraries (pybloom) Probabilistic membership testing Tracks seen sequences/patterns in constant memory, eliminating redundant comparisons.
UCX & OpenMPI High-performance inter-process communication Critical for low-latency data transfer in distributed alignment workflows on HPC clusters.
Zarr / HDF5 Formats Chunked, compressed array storage Stores massive sequence distance matrices on disk with efficient partial I/O, avoiding RAM limits.
Snakemake / Nextflow Workflow management Orchestrates complex, multi-step pipelines with automatic resource request and checkpointing.
Intel ISA-L / SSE/AVX Hardware-accelerated string kernels Optimizes core edit-distance and hashing calculations via CPU SIMD instructions.
NumPy / SciPy Sparse Sparse matrix operations Efficiently represents and computes on sparse sequence similarity graphs.

Within the broader thesis on HILARy (Heavy-paiR Lineage ReconstRuction) for B-cell clonal family inference from Rep-Seq data, successful multi-omics integration is paramount. These application notes provide detailed protocols for converting, validating, and analyzing HILARy's clonal lineage outputs within established immunological and genomic pipelines, enabling systems-level investigation of adaptive immune responses.

HILARy Output Formats and Conversion Protocols

HILARy generates primary outputs detailing inferred clonal families, phylogenetic trees, and mutation annotations. Direct compatibility with downstream tools requires structured conversion.

Primary Output Structure

Table 1: Core HILARy Output Files and Descriptions

File Name Format Key Content Primary Use
clonal_families.tsv TSV Clone ID, Sequence ID, Isotype, V/J gene, CDR3 Core clonal grouping
lineage_trees.nwk Newick Phylogenetic tree per clone Lineage visualization & evolution
mutations.json JSON Nucleotide/AA substitutions per branch Somatic hypermutation analysis
convergence_groups.txt Text Groups of clones with similar CDR3s Repertoire convergence detection

Protocol: Conversion to AIRR-Compliant Format

The Adaptive Immune Receptor Repertoire (AIRR) Community standards ensure cross-tool compatibility.

Materials:

  • Input: HILARy clonal_families.tsv, original sequence alignment files.
  • Software: Python 3.9+, pandas library, airr standards library.
  • Reference: AIRR Rearrangement schema (v1.4).

Procedure:

  • Load the HILARy clonal assignments and the original sequencing data (e.g., in changeo or immunarch compatible format).
  • Map the sequence_id column from HILARy to the sequence_id in the AIRR-formatted TSV.
  • Create a new column clone_id in the AIRR TSV, populating it with HILARy's assignments. Use -1 for singletons not assigned to a clone.
  • For clones with lineage trees, create a separate airr-trees.json file. Convert each Newick tree to the PhyloXML-based AIRR Tree schema using the python biopython library.
  • Validate the output files using the airr-tools validate command-line utility or the Airr R package validation functions.

HILARy_AIRR_Conversion HILARy_TSV HILARy clonal_families.tsv Python_Script Conversion Script (Python/pandas) HILARy_TSV->Python_Script Seq_Data Original Sequence Data (AIRR-compliant TSV) Seq_Data->Python_Script AIRR_Rearr AIRR Rearrangement.tsv (with clone_id) Python_Script->AIRR_Rearr AIRR_Trees AIRR Trees.json (PhyloXML) Python_Script->AIRR_Trees Validator AIRR Tools Validator AIRR_Rearr->Validator AIRR_Trees->Validator

HILARy to AIRR Standards Conversion Workflow

Protocol for Multi-Omic Integration with Transcriptomic Data

Correlating clonal expansion with gene expression profiles from single-cell RNA sequencing (scRNA-seq) reveals functional states of expanded B-cell clones.

Experimental Design & Reagent Solutions

Table 2: Key Reagents for Linked BCR-seq & scRNA-seq

Reagent / Solution Vendor (Example) Function in Multi-Omics Workflow
10x Genomics Chromium Next GEM Single Cell 5' v2 10x Genomics Partitions single cells for co-encapsulation of mRNA and V(D)J transcripts.
Feature Barcoding technology (CellPlex or Antibody) 10x Genomics Allows sample multiplexing, critical for pooling patients/conditions pre-assay.
BD Rhapsody BCR Single-Cell Analysis System BD Biosciences Alternative platform for coupled whole transcriptome and targeted BCR amplification.
SMARTer V(D)J Reagents for T and B Cells Takara Bio Provides template-switching for full-length V(D)J enrichment in plate-based protocols.
Cell Hashing Antibodies (TotalSeq-B) BioLegend Antibodies conjugated to oligonucleotide barcodes for sample multiplexing prior to 10x runs.

Protocol: Clonal Tracing in scRNA-seq Data

This protocol assumes scRNA-seq data with paired BCR amplification (e.g., from 10x Genomics Cell Ranger) has been generated.

Materials:

  • Input: filtered_contig_annotations.csv (from Cell Ranger VDJ), gene_expression_matrix (from Cell Ranger Count).
  • Software: R (v4.2+), Seurat (v5.0), scRepertoire (v1.10), immunarch.
  • Reference: HILARy-derived clone list for the same donor.

Procedure:

  • Load Data: Create a Seurat object from the gene expression matrix. Separately, load the VDJ contig data using scRepertoire::combineTCR().
  • Merge Assays: Add the clonal information as a new assay or metadata to the Seurat object using scRepertoire::combineExpression().
  • Cross-Reference with HILARy:
    • Extract the CDR3 nucleotide sequences and associated V/J genes for clones of interest from HILARy output.
    • Query the single-cell VDJ data for cells containing matching CDR3 sequences and V/J genes. Use a fuzzy matching algorithm (allowing 1-2 nucleotide mismatches) to account for sequencing errors.
    • Create a new metadata column (e.g., hilar_clone) in the Seurat object, labeling cells belonging to HILARy-inferred clones.
  • Downstream Analysis: Subset the Seurat object to focus on cells from expanded clones. Perform differential gene expression (DGE) using Seurat::FindMarkers() between cells of an expanded clone versus all other B cells. Conduct pathway enrichment analysis on DGE results using clusterProfiler.

Clonal_ScRNA_Integration HILARy_Clones HILARy Clone List (CDR3, V/J) Matching Fuzzy CDR3/V/J Matching (R/scRepertoire) HILARy_Clones->Matching Sc_VDJ_Data Single-Cell VDJ Data (filtered_contig.csv) Sc_VDJ_Data->Matching Sc_GEX_Data Single-Cell Gene Expression (Seurat Object) Annot_Seurat Annotated Seurat Object (with clone_id) Sc_GEX_Data->Annot_Seurat Matching->Annot_Seurat DGE Differential Expression & Pathway Analysis Annot_Seurat->DGE

Clonal Tracing in scRNA-seq Data Workflow

Protocol for Integrating Clonal Dynamics with Clinical Metadata

Longitudinal analysis links clone expansion/contraction to patient treatment and outcome.

Protocol: Longitudinal Clone Tracking Dashboard

Materials:

  • Input: HILARy outputs across multiple timepoints, Patient clinical data (CSV).
  • Software: R tidyverse, shiny, ggiraph, survival.
  • Database: (Optional) SQLite database for large cohort data.

Procedure:

  • Data Harmonization: For each patient/timepoint, load the HILARy clonal_families.tsv. Calculate clonal metrics: Shannon diversity, clone size distribution, largest clone fraction.
  • Merge with Clinical Data: Create a master data frame linking each sample's clonal metrics to clinical variables (e.g., therapy received, disease activity score, response status).
  • Build Shiny Application:
    • UI: Create input selectors for Patient ID, Clone ID, Time Range. Define outputs: plot for clone size over time, heatmap of top clones across timepoints, Kaplan-Meier plot panel.
    • Server Logic:
      • For a selected clone, retrieve its relative frequency at all timepoints and plot against clinical lab values (e.g., serum IgG titer).
      • Generate a survival plot where patients are stratified by the presence/absence or size of a specific immunodominant clone at baseline, using the survival::survfit() function.
  • Deploy: Deploy the dashboard locally or on a secure server for collaborative clinical review.

Longitudinal_Dashboard HILARy_Timepoints HILARy Outputs (T1, T2, ...Tn) Data_Engine R Data Engine (Calculate Metrics, Merge) HILARy_Timepoints->Data_Engine Clinical_DB Clinical Database (Therapy, Outcomes) Clinical_DB->Data_Engine Shiny_Server Shiny Server Logic (Plots, Stats) Data_Engine->Shiny_Server Dashboard Interactive Dashboard (Plots, Tables) Shiny_Server->Dashboard

Longitudinal Clinical Integration Dashboard

Validation Protocol: Cross-Tool Clonal Concordance

Ensuring HILARy's inferences are robust requires benchmarking against other clonal grouping tools.

Experimental Protocol

Materials:

  • Input: Public benchmark dataset (e.g., from Observed Antibody Space) or in-house Rep-Seq data.
  • Software: HILARy, changeo-clone (DefineClones.py), partis, scoper (R).
  • Compute: High-performance computing cluster recommended.

Procedure:

  • Data Processing: Starting from the same filtered AIRR Rearrangement file, generate clonal groupings using each tool with its default parameters for identical V/J gene and CDR3 nucleotide distance thresholds.
  • Comparison Metric Calculation: Use the clonality R package or custom scripts to compute:
    • Pairwise Adjusted Rand Index (ARI) between tools.
    • Jaccard index for specific large clones.
    • Run-time and memory usage for each tool on identical hardware.
  • Ground Truth Comparison: If using simulated data with known clonal origins, calculate precision, recall, and F1-score for each tool's ability to recover true clones.
  • Visualization: Create a clustered heatmap of ARI scores and a bar chart of performance metrics.

Table 3: Example Cross-Tool Concordance Results (Simulated Data)

Tool Comparison (A vs B) Adjusted Rand Index (ARI) Mean Jaccard of Top 10 Clones Relative Runtime
HILARy vs. changeo-clone 0.92 0.88 1.5x
HILARy vs. partis 0.85 0.79 0.3x
HILARy vs. scoper 0.95 0.91 2.1x

Advanced Pathway: Structural Modeling of Convergent Antibodies

Integrating HILARy-identified convergent responses with structural prediction tools.

Protocol: From Clone to Structure

Materials:

  • Input: HILARy convergence_groups.txt, annotated AIRR file.
  • Software: ANARCI for domain annotation, IgFold or ABodyBuilder for structure prediction, PyMOL for visualization.
  • Web Service: NCBI BLAST for germline gene identification.

Procedure:

  • Select a convergence group from HILARy output containing clones from multiple subjects with similar CDR3s.
  • For each clone representative, use ANARCI to assign IMGT numbering and identify framework/cdr regions.
  • Submit the heavy and light chain Fv sequences to IgFold (via local install or API) to generate a predicted 3D model in PDB format.
  • Superimpose the predicted structures of convergent antibodies in PyMOL to analyze structural commonalities in paratope geometry.
  • Dock the consensus structural model to a target antigen (if known) using ZDOCK or HADDOCK to hypothesize epitope specificity.

These application notes provide a actionable framework for integrating HILARy's precise clonal inferences into multi-omics workflows, thereby amplifying its value within a thesis on B-cell repertoire dynamics and enabling translational discoveries in immunology and drug development.

Benchmarking HILARy: Validation Strategies and Comparative Analysis with Alternative Tools

Within the broader thesis on HILARy (High-accuracy Inference of Lymphocyte Antigen Receptor families) clonal family inference from repertoire sequencing, establishing robust ground truth is paramount. This Application Notes and Protocols document details methodologies for generating and validating synthetic immune receptor repertoires, alongside functional validation using engineered in vitro cell line systems. These approaches provide controlled datasets to benchmark clonal clustering, lineage reconstruction, and diversity estimation algorithms, directly addressing key challenges in therapeutic antibody discovery and immune monitoring.

High-throughput sequencing of B-cell and T-cell receptor repertoires enables insights into adaptive immune responses. However, computational inference of clonal families—groups of lymphocytes descended from a common ancestor—suffers from ambiguous validation due to the lack of known truth sets in biological samples. Synthetic repertoires with pre-defined clonal structures and in vitro cell lines with known antigen specificities provide essential validation frameworks to assess the accuracy, sensitivity, and specificity of tools like HILARy.

Research Reagent Solutions Toolkit

The following table lists essential reagents and resources for conducting ground truth validation experiments.

Item Name Supplier/Catalog Example Function in Validation
Synthetic V(D)J Reference Standards e.g., AIRR-seq Control Library (LegoChem) Provides DNA/RNA mixes with known clonal families, V/D/J usage, and mutation profiles for sequencing platform and pipeline calibration.
gBlock Gene Fragments Integrated DNA Technologies (IDT) Custom double-stranded DNA fragments used to construct synthetic immune receptor genes with specified mutations for clonal lineage simulation.
HEK 293T Cell Line ATCC CRL-3216 Highly transfectable cell line used for in vitro expression of synthetic antibody or TCR libraries for functional screening.
pFUSE Vectors Invivogen Modular antibody expression plasmids (IgG, Fab) for cloning synthetic variable regions into constant domain backbones.
Fluorescent Antigen Probes e.g., MHC Dextramers (Immudex) Multimeric peptide-MHC complexes conjugated to fluorophores for staining and sorting T-cells with known antigen specificity.
Cell Sorting Buffers BD Pharmingen Stain Buffer PBS-based buffers with fetal bovine serum to maintain cell viability during fluorescent-activated cell sorting (FACS) based on antigen binding.
Next-Gen Sequencing Kit Illumina MiSeq v3 (600-cycle) Provides sufficient read length for full-length variable region sequencing of paired heavy and light chains.
UMI Adapter Kit NEBNext Multiplex Oligos for Illumina Adds unique molecular identifiers (UMIs) to cDNA during library prep to correct for PCR amplification bias and sequencing errors.

Protocols

Protocol A: Generation and Sequencing of a Synthetic Repertoire with Defined Clonality

Objective: To create a DNA library mimicking a B-cell receptor repertoire with pre-defined clonal families, somatic hypermutations, and abundances for benchmarking HILARy’s clustering performance.

Materials:

  • gBlock gene fragments (IDT)
  • Q5 High-Fidelity DNA Polymerase (NEB)
  • MiSeq Reagent Kit v3 (Illumina)
  • UMI Adapter Kit

Procedure:

  • Clonal Family Design: Design 10-50 distinct "founder" heavy-chain V(D)J sequences. For each founder, generate 5-20 "descendant" sequences by introducing random point mutations (simulating SHM) at a defined rate (e.g., 2-10%).
  • Sequence Synthesis: Order all designed variable region sequences as gBlock fragments, each flanked by amplification primers and restriction sites.
  • Library Assembly: Amplify each gBlock via PCR. Gel-purify and pool fragments in a stratified manner to simulate realistic clonal frequency distributions (e.g., a few dominant clones, many rare clones). Ligate into a linearized vector backbone.
  • Next-Generation Sequencing: Prepare sequencing libraries from the plasmid pool using a kit that incorporates UMIs. Sequence on an Illumina MiSeq platform with 2x300bp paired-end reads to ensure full coverage of the variable region.
  • Ground Truth Table Generation: Create a comprehensive table mapping every synthesized DNA sequence to its assigned clonal family founder.

Table 1: Synthetic Repertoire Ground Truth Summary

Clonal Family ID Founder V/J Genes Number of Unique Sequences Avg. Mutation Rate (%) Designed Frequency in Pool (%)
CF_01 IGHV1-201, IGHJ401 15 5.2 12.5
CF_02 IGHV3-2304, IGHJ602 8 3.7 8.1
CF_03 IGHV4-3401, IGHJ501 22 8.9 5.4
... ... ... ... ...
CF_48 IGHV5-5103, IGHJ302 5 2.1 0.1

Protocol B: In Vitro Validation Using an Engineered Antigen-Specific B-Cell Line

Objective: To functionally validate clonal families inferred by HILARy by expressing paired heavy and light chains from a putative family and testing for shared antigen specificity.

Materials:

  • HEK 293T cells
  • pFUSE IgG expression vectors
  • Recombinant target antigen (e.g., HIV gp120)
  • FACS buffer, anti-human IgG Fc detection antibody

Procedure:

  • Candidate Selection: From a biological repertoire sequenced and analyzed by HILARy, select 3-5 representative sequences from a computationally inferred clonal family.
  • Antibody Expression: Clone the heavy and light chain variable regions for each selected sequence into pFUSE-IgG1 vectors. Co-transfect HEK 293T cells in separate wells with heavy/light chain plasmid pairs.
  • Supernatant Harvest: Culture transfected cells for 5-7 days. Harvest cell culture supernatant containing secreted IgG.
  • Antigen Binding Assay (ELISA): Coat ELISA plates with target antigen. Add supernatants. Detect bound IgG using an enzyme-conjugated anti-human Fc antibody. Include positive (known binder) and negative (untransfected supernatant) controls.
  • Data Interpretation: Consistent antigen binding across antibodies derived from the same HILARy-inferred family provides strong functional validation of the clonal grouping. Discrepancies indicate potential inference errors.

Table 2: In Vitro Binding Results for HILARy-Inferred Clonal Family #7

Test Antibody (Sequence ID) Clonal Family Assignment (HILARy) ELISA OD450 (Mean ± SD) Antigen Binding Positive?
BioSeq145 CF_07 2.34 ± 0.21 Yes
BioSeq149 CF_07 1.89 ± 0.15 Yes
BioSeq152 CF_07 0.08 ± 0.02 No
BioSeq160 CF_07 2.01 ± 0.18 Yes
Positive Control N/A 2.50 ± 0.10 Yes
Negative Control N/A 0.05 ± 0.01 No

Visualizations

G Start Define Validation Objective A Synthetic Data Track Start->A B In Vitro Validation Track Start->B A1 Design Clonal Families (Founders & Mutants) A->A1 A2 Synthesize gBlocks & Assemble Library A1->A2 A3 Sequence Library with UMIs A2->A3 A4 Generate Ground Truth Table A3->A4 Benchmark Benchmark & Refine HILARy Algorithm A4->Benchmark B1 Select Candidates from HILARy-Inferred Families B->B1 B2 Clone & Express Antibodies in 293T Cells B1->B2 B3 Perform Functional Assay (e.g., Antigen ELISA) B2->B3 B4 Compare Function to Computational Inference B3->B4 B4->Benchmark

Title: Ground Truth Validation Workflow for HILARy

G cluster_0 Validation Points Input Raw AIRR-seq Reads UMI UMI Consensus & Error Correction Input->UMI VDJ V(D)J Alignment & Annotation UMI->VDJ Cluster Clonal Clustering (Initial Graph) VDJ->Cluster Family Clonal Family Output Cluster->Family InVitroVal In Vitro Functional Assay Family->InVitroVal SynthVal Synthetic Repertoire Benchmarking SynthVal->Cluster

Title: HILARy Inference Pipeline with Validation Points

Within the HILARy (High-throughput Immune-repertoire Lineage and Repertoire) clonal family inference framework, the accurate evaluation of algorithm performance is paramount for advancing repertoire sequencing research and its applications in immunology and therapeutic discovery. This protocol details the standardized application of the core metrics—Precision, Recall, and Computational Efficiency—to assess and compare clonal inference tools.

Core Performance Metrics: Definitions & Quantitative Benchmarks

Table 1: Core Metric Definitions and Formulas

Metric Definition Formula
Precision The fraction of inferred clonal relationships that are correct (True Positives) out of all inferred relationships. Measures correctness. Precision = TP / (TP + FP)
Recall (Sensitivity) The fraction of all true clonal relationships that are correctly identified by the inference algorithm. Measures completeness. Recall = TP / (TP + FN)
F1-Score The harmonic mean of Precision and Recall, providing a single balanced metric. F1 = 2 * (Precision * Recall) / (Precision + Recall)
Computational Efficiency The computational resources required for analysis, typically measured as wall-clock time and peak memory (RAM) usage. Time (seconds), Memory (GB)

Table 2: Example Benchmark Results for Select Inference Tools Data sourced from recent benchmarking studies (e.g., Immcantation framework, DANGER comparisons).

Tool / Algorithm Precision Recall F1-Score Time (min) Memory (GB)
Partis 0.95 0.85 0.90 120 8.2
SCOPer 0.92 0.88 0.90 95 6.5
Hierarchical Clustering 0.80 0.95 0.87 45 4.0
IGH-DATA 0.98 0.75 0.85 180 12.0

Experimental Protocols

Protocol 1: Generating a Ground Truth Dataset for Metric Calculation

Objective: To create a validated set of clonal families from synthetic or spike-in control data to serve as the benchmark for calculating Precision and Recall.

Materials: See "The Scientist's Toolkit" below. Procedure:

  • Synthetic Repertoire Generation: Use a tool like IGH-SIM or SONAR to generate a synthetic adaptive immune receptor repertoire (AIRR) dataset.
    • Input parameters must include a known, predefined number of distinct clonal families (germline sequences).
    • Introduce realistic levels of somatic hypermutation (SHM) and sequencing errors per experimental design.
  • Data Processing: Process the raw synthetic reads through a standardized pipeline (e.g., pRESTO, IgBLAST) for quality control, V(D)J alignment, and generation of Change-O formatted tables.
  • Ground Truth Annotation: Using the simulation metadata, annotate each sequence in the processed dataset with its known clone of origin. This annotated file is the ground truth.
  • Clonal Inference: Run the clonal inference algorithm(s) under test (e.g., using Change-O's DefineClones.py) on the processed, but un-annotated, synthetic data to generate group assignments for each sequence.
  • Metric Calculation: Use a script (e.g., in R with shazam and dplyr) to compare algorithm assignments against the ground truth.
    • A True Positive (TP) is a pair of sequences placed in the same inferred clone that share the same ground truth label.
    • A False Positive (FP) is a pair placed in the same inferred clone but with different ground truth labels.
    • A False Negative (FN) is a pair placed in different inferred clones but sharing the same ground truth label.
    • Calculate Precision, Recall, and F1-Score using the formulas in Table 1.

Protocol 2: Benchmarking Computational Efficiency

Objective: To reproducibly measure the runtime and memory consumption of a clonal inference tool.

Materials: Computing infrastructure (HPC, cloud, or local server), containerization software (Docker/Singularity), system monitoring tool (/usr/bin/time, psrecord). Procedure:

  • Environment Standardization: Containerize the analysis pipeline using Docker to ensure consistent dependency versions across runs.
  • Dataset Curation: Prepare a series of input files (in AIRR .tsv format) of increasing size (e.g., 10^3, 10^4, 10^5, 10^6 sequences).
  • Execution & Profiling: For each input size:
    • Use the time command (e.g., /usr/bin/time -v) to execute the core clonal inference command.
    • Record the "Elapsed (wall clock) time" and "Maximum resident set size (kbytes)".
    • Optionally, use a profiler like psrecord to graph CPU and memory usage over time.
  • Data Aggregation: Plot runtime and peak memory usage against input size. The slope indicates scalability—a critical factor for large-scale repertoire studies.

Visualizations

workflow SyntheticData Synthetic/Spike-in AIRR-Seq Data Processing Data Processing (QC, VDJ Alignment) SyntheticData->Processing GroundTruth Annotated Ground Truth Processing->GroundTruth TestAlgorithm Clonal Inference Algorithm (Test) Processing->TestAlgorithm Comparison Pairwise Comparison (TP, FP, FN) GroundTruth->Comparison InferredClones Inferred Clonal Groups TestAlgorithm->InferredClones InferredClones->Comparison Metrics Calculate Metrics (Precision, Recall, F1) Comparison->Metrics

Title: Performance Evaluation Workflow

metrics P Precision (PPV) F1 F1-Score (Harmonic Mean) P->F1 R Recall (Sensitivity) R->F1 a a->P TP/(TP+FP) a->R TP/(TP+FN) b

Title: Relationship Between Core Metrics

The Scientist's Toolkit

Table 3: Essential Research Reagents & Solutions for Performance Benchmarking

Item Function/Description
Synthetic Repertoire Simulators (e.g., IGH-SIM, SONAR) Generates ground truth AIRR-seq data with known clonal relationships for controlled benchmarking.
AIRR-Compliant Data Files (.tsv) Standardized input/output format (via AIRR Community) ensuring interoperability between tools.
Container Images (Docker/Singularity) Provides reproducible, version-controlled computational environments (e.g., Immcantation, VDJServer images).
Benchmarking Suites (e.g., DANGER, ImmBench) Curated scripts and datasets for standardized performance comparison across multiple algorithms.
High-Performance Computing (HPC) Resources Essential for running efficiency benchmarks on large datasets, measuring scalability.
AIRR Tools (pRESTO, IgBLAST, Change-O) Core software suite for processing raw reads, performing V(D)J alignment, and basic clonal grouping.
R/Python Packages (shazam, dplyr, scipy, pandas) Libraries for calculating metrics, statistical analysis, and visualizing benchmarking results.

Application Notes

Clonal family inference from B-cell receptor (BCR) repertoire sequencing is a foundational step in immunoinformatics, enabling the study of adaptive immune responses, antibody discovery, and lymphoid cancer phylogenetics. This analysis compares four prominent tools—HILARy, partis, Change-O, and SCOPer—within the context of a thesis focused on HILARy's methodology and performance.

  • HILARy (Hierarchical clustering for Lineage Analysis of Repertoires): A method leveraging hierarchical clustering based on sequence similarity thresholds and V/J gene annotations. It is designed for high-throughput analysis of large-scale repertoire data, balancing computational efficiency with accurate lineage grouping.
  • partis: A probabilistic framework that uses hidden Markov models (HMMs) for BCR annotation and clustering. It estimates parameters from data for germline inference and somatic hypermutation (SHM) modeling, offering high accuracy at increased computational cost.
  • Change-O: A suite of tools for advanced BCR repertoire analysis. Its DefineClones.py script performs single-linkage clustering based on nucleotide or amino acid distance thresholds, often requiring prior annotation from tools like IMGT/HighV-QUEST.
  • SCOPer (Spectral Clustering Of Paired-end Reads): A spectral clustering algorithm designed specifically to handle the complexities of single-cell paired heavy and light chain data, preserving natural pairings while clustering sequences.

Table 1: Core Algorithm & Quantitative Performance Comparison

Tool Core Algorithm Primary Input Key Strengths Reported Accuracy* (F1-score/Precision) Computational Demand
HILARy Hierarchical clustering with adaptive thresholds Annotated sequences (V/J, CDR3) Speed, scalability for bulk data, intuitive thresholds ~0.92-0.95 (on simulated bulk data) Low-Medium
partis HMM-based probabilistic clustering Raw FASTQ reads High accuracy, integrated annotation/germline inference, SHM modeling ~0.96-0.98 (on simulated data) High
Change-O Single-linkage clustering Annotated sequences (e.g., from IMGT) Flexibility, integrates with extensive downstream analysis pipeline ~0.90-0.94 (depends on annotation source) Low
SCOPer Spectral clustering Paired heavy-light chain sequences Preserves natural pairings, effective for complex single-cell data ~0.94-0.97 (on paired-cell simulations) Medium-High

*Accuracy metrics are approximate and dataset-dependent. Benchmarks typically use simulated repertoires with known ground truth.

Table 2: Contextual Application Suitability

Feature HILARy partis Change-O SCOPer
Optimal Data Type Bulk Ig-seq (e.g., RNA) Bulk Ig-seq from raw reads Pre-annotated bulk sequences Single-cell BCR-seq (paired)
Germline Inference Requires external tool Integrated, sophisticated Requires external tool (e.g., IgBLAST) Limited, often uses external
SHM Modeling No Yes, detailed Post-hoc analysis (e.g., BASELINe) Within inferred clones
Output Integration Clonal tables Clonal tables, annotated FASTA Comprehensive Change-O/Immcatation formats Clonal networks, pairings

Experimental Protocols

Protocol 1: Benchmarking Clonal Inference Accuracy Using Simulated Data Objective: To quantitatively compare the clonal grouping performance of HILARy, partis, Change-O, and SCOPer.

  • Data Simulation: Use IGoR or AbSim to generate a synthetic BCR repertoire dataset with known, true clonal families. Include parameters for SHM frequency (~5-15%), diverse V/J gene usage, and varying clone sizes.
  • Tool Execution:
    • HILARy: Run HILARy on the simulated sequences (pre-annotated with V/J genes and CDR3 regions using IgBLAST). Use default distance thresholds (e.g., nucleotide Hamming distance = 0.1).
    • partis: Run partis partition directly on the raw simulated FASTQ reads.
    • Change-O: Annotate simulated sequences with IgBLAST. Run DefineClones.py with distance threshold = 0.1 (nucleotide).
    • SCOPer: For paired data simulation, run SCOPer spectral clustering on the concatenated heavy-light chain sequences.
  • Validation: Compare the inferred clusters from each tool to the ground truth simulation labels. Calculate precision, recall, and F1-score for clonal assignment.

Protocol 2: Processing Human PBMC Bulk BCR-seq Data Objective: To apply each tool to real-world human peripheral blood mononuclear cell (PBMC) repertoire data.

  • Sample Prep & Sequencing: Isolate RNA from PBMCs, perform reverse transcription with gene-specific primers for IGH genes, and sequence on an Illumina platform (2x300 bp).
  • Preprocessing: Use pRESTO for read quality control, masking of primers/adapters, merging paired-end reads, and filtering out non-functional sequences.
  • Clonal Grouping Paths:
    • Path A (HILARy/Change-O): Annotate merged FASTA with IgBLAST against IMGT reference. Run HILARy or Change-O's DefineClones.py on the output.
    • Path B (partis): Provide the preprocessed (but unannotated) FASTA directly to partis partition.
  • Analysis: Compare the number of clones, clone size distributions, and SHM levels per clone across tools.

Protocol 3: Single-Cell BCR-seq Analysis with SCOPer and HILARy Adaptation Objective: To evaluate clustering on paired heavy-light chain data.

  • Single-Cell Library: Generate data using 10x Genomics Chromium or similar platform yielding linked V(D)J sequences.
  • Cellranger: Use cellranger vdj for initial cell calling, assembly, and annotation of paired contigs.
  • Clonal Grouping: Feed the paired heavy-light chain FASTA files to SCOPer for spectral clustering that respects natural pairings.
  • Comparative Analysis: As a contrast, separate heavy chains from the pairs and run them through HILARy using standard bulk settings. Compare the resulting clone assignments and the preservation of light chain information.

Visualizations

G Start Input: Pre-annotated BCR Sequences HILARy HILARy Hierarchical Clustering (Adaptive Threshold) Start->HILARy Bulk Data partis partis HMM Probabilistic Clustering Start->partis Raw Reads ChangeO Change-O Single-Linkage Clustering Start->ChangeO Annotated Data SCOPer SCOPer Spectral Clustering Start->SCOPer Paired H-L Data Output Output: Clonal Families HILARy->Output partis->Output ChangeO->Output SCOPer->Output

Clonal Inference Tool Selection Workflow

H Seq BCR Sequence Reads VJAnnot V/J Gene & CDR3 Annotation (e.g., IgBLAST) Seq->VJAnnot DistMat Calculate Pairwise Distance Matrix VJAnnot->DistMat Hierarch Agglomerative Hierarchical Clustering DistMat->Hierarch Thresh Apply Distance Threshold (Cut Tree) Hierarch->Thresh Clusters Clonal Families Thresh->Clusters

HILARy Hierarchical Clustering Process

The Scientist's Toolkit: Key Research Reagent Solutions

Table 3: Essential Materials for BCR Clonal Inference Workflows

Item Function & Application
IMGT/GENE-DB Reference Gold-standard database of immunoglobulin gene alleles. Essential for accurate V(D)J gene annotation.
IgBLAST Command-line tool from NCBI for aligning BCR sequences to germline references. Provides detailed annotation.
pRESTO Toolkit Suite of Python scripts for preprocessing raw sequencing reads: quality filtering, merging, deduplication.
Synthetic BCR Libraries (e.g., from IGoR) Generate ground-truth simulated repertoire data for benchmarking algorithm accuracy.
10x Genomics Chromium Single Cell V(D)J Kit Commercial solution for generating linked heavy-light chain BCR sequences from single cells.
MiXCR Alternative integrated software for end-to-end analysis (alignment, assembly, clustering). Useful for cross-validation.
Immcatation Database Online resource and database schema for storing, sharing, and analyzing annotated immune repertoire data.

Within repertoire sequencing research, inferring B-cell or T-cell clonal families from high-throughput sequencing data is a foundational step. A variety of clustering methods exist, each with distinct algorithmic approaches. This Application Note provides a detailed analysis of the HILARy (High-throughput lymphocyte Analysis by Reconstruction) method, contrasting it with other prevalent techniques to guide researchers and drug development professionals in selecting the optimal tool for their experimental goals.

Comparison of Clonal Family Inference Methods

The following table summarizes the core characteristics, strengths, and limitations of HILARy against other common methods.

Table 1: Quantitative and Qualitative Comparison of Clustering Methods

Method Core Algorithm Primary Strength Key Limitation Optimal Use Case
HILARy Expectation-Maximization on V(D)J junctions + Phylogeny Integrates lineage tree likelihood; models hypermutation. Computationally intensive. Best for somatic hypermutation (SHM)-rich repertoires (e.g., antigen-experienced B cells).
Change-O (DEFINE) / GLIPH2 Hierarchical clustering on Hamming distance / TCR motif Fast, highly sensitive to small clones. Ignores SHM; may split clones with high mutation. Initial broad screening; TCR specificity groups.
Partis Hidden Markov Model (HMM) on full V(D)J High accuracy annotating V/D/J and inferring naive ancestor. High resource demand for large datasets. Detailed annotation and naive sequence reconstruction.
Decombinator / mixcr Rule-based CDR3 identification + clustering Extremely fast, standardized pipeline. Less accurate for highly mutated sequences. High-volume initial processing and annotation.

Table 2: Performance Metrics on Benchmark Datasets*

Method Precision (Mean) Recall (Mean) F1-Score (Mean) Avg. Runtime (10^5 seqs)
HILARy 0.95 0.88 0.91 ~8 hours
Change-O (DEFINE) 0.91 0.85 0.88 ~15 minutes
Partis 0.97 0.90 0.93 ~6 hours
mixcr 0.89 0.92 0.90 ~10 minutes

*Synthetic benchmark data simulating human B-cell repertoires with varying SHM levels (0-15%). Runtime is approximate and system-dependent.

Detailed Protocol for HILARy Clonal Inference

This protocol is designed for B-cell receptor (BCR) heavy chain repertoire sequencing data.

I. Preprocessing and Input Preparation

  • Sequence Annotation: Use IgBLAST or mixcr to align sequences to IMGT reference genes. Output must include: (a) V, D, J gene calls, (b) nucleotide CDR3 sequence, (c) alignment details.
  • Data Formatting: Convert annotations to HILARy's required format (FASTA with IMGT-gapped sequences). The CreateGermlines.py tool (from Change-O suite) can infer the germline V segment sequence for each read.
  • Quality Filtering: Remove sequences with stop codons in CDR3, low alignment scores, or non-productive rearrangements.

II. Running HILARy Clustering

Critical Parameters:

  • --dist: Initial Hamming distance threshold for pre-clustering (CDR3 nucleotide). Adjust based on error rate.
  • --iter: Maximum number of EM iterations. Increase for complex datasets.
  • --collapse: Collapse unique sequences while preserving duplication counts.

III. Post-processing and Output Interpretation

  • Output Files: Key outputs include *_clones.txt (clone assignments) and *_trees.json (lineage trees per clone).
  • Validation: Assess clone size distribution. Unexpectedly many singletons may indicate overly stringent clustering.
  • Downstream Analysis: Use Dowser (compatible toolkit) to analyze and visualize the inferred phylogenetic trees for SHM patterns and selection pressure.

Visualization of Method Selection and Workflow

G Start Input: BCR/TCR Seq Data Q1 Is the repertoire highly somatically hypermutated? Start->Q1 Q2 Is computational speed the primary constraint? Q1->Q2 No A_HIL Choose HILARy Q1->A_HIL Yes Q3 Is naive ancestor inference a key goal? Q2->Q3 No A_Mix Choose mixcr Q2->A_Mix Yes A_ChO Choose Change-O/DEFINE Q3->A_ChO No A_Par Choose Partis Q3->A_Par Yes

Selection Logic for Clustering Methods

G Raw Raw FASTQ Reads Ann Alignment & Annotation (tool: IgBLAST/mixcr) Raw->Ann Filt Quality Filtering Ann->Filt PreC Pre-cluster by V/J & CDR3 (tool: DefineClones.py) Filt->PreC HIL HILARy Core EM Algorithm - Model SHM - Infer Lineage Trees PreC->HIL Out Clone Tables & Lineage Trees HIL->Out Down Downstream Analysis (Clonal Dynamics, Selection) Out->Down

HILARy Experimental Workflow

The Scientist's Toolkit: Essential Research Reagents & Software

Table 3: Key Resources for HILARy-based Repertoire Analysis

Item Function & Relevance
IMGT/GENE-DB Reference Database Gold-standard reference for V, D, J gene alleles. Essential for accurate initial sequence alignment.
IgBLAST or mixcr Software for performing the initial V(D)J alignment and annotation. Creates the necessary input for HILARy.
Change-O Toolkit Suite Provides essential utilities (DefineClones.py, CreateGermlines.py) for data reformatting and pre-clustering.
HILARy Software Package Core software implementing the expectation-maximization and phylogenetic inference algorithm.
Dowser Package Specialized R package for analyzing and visualizing phylogenetic trees output by HILARy.
Synthetic Benchmark Datasets Known-truth datasets (e.g., from AbSynth) for validating pipeline performance and tuning parameters.
High-Memory Compute Node HILARy's EM algorithm is memory and CPU intensive; >32GB RAM and multiple cores are recommended.

HILARy is the method of choice when the research question centers on the phylogenetic history and somatic hypermutation patterns of B-cell clonal families, such as in studies of affinity maturation, vaccine response, or chronic infection. Its primary strength is integrating clonal partitioning with lineage tree inference, offering a biologically nuanced model at the cost of computational speed. For rapid, large-scale screening or analysis of minimally mutated repertoires (e.g., naive B cells or most TCR studies), faster methods like Change-O or mixcr are more appropriate. The selection framework and protocols provided here enable informed methodological decisions in repertoire sequencing research.

1. Introduction and Thesis Context This Application Note details a methodology for evaluating the consistency of clonal family inference tools, a central challenge in B-cell repertoire sequencing (Rep-Seq) analysis. The study is framed within a broader thesis on HIgh-throughput Lymphocyte Antigen Receptor (HILARy) clonal family inference, which posits that methodological discrepancies in clonotyping significantly impact downstream biological interpretations, such as tracking vaccine-induced B-cell lineages or identifying therapeutic antibody candidates. We apply multiple publicly available clonotyping tools to a standard public COVID-19 Rep-Seq dataset to assess concordance.

2. Data Source and Pre-processing

  • Dataset: ARchive of B-cell Immunoglobulin Sequences (AbSeq) from COVID-19 convalescent patients (e.g., Study PRJNA629089). Raw FASTQ files for IgG+ memory B-cell repertoires were downloaded.
  • Pre-processing Protocol:
    • Quality Control & Merging: Use fastp (v0.23.2) with parameters --detect_adapter_for_pe --merge --merged_out to trim adapters, remove low-quality bases (Q<20), and merge paired-end reads.
    • Alignment & Assembly: Align merged reads to IMGT reference V, D, J genes using IgBLAST (v1.19.0) with the -organism human flag. Generate AIRR-compliant Rearrangement tables (.tsv).
    • Data Curation: Filter productive sequences only (sequence_alignment starts with 'C', no stop codons). Remove sequences with low confidence V gene assignment (v_identity < 0.95).

3. Clonal Inference Tool Application Protocol Four tools representing different algorithmic approaches were applied to the same pre-processed AIRR.tsv file.

  • Tool 1: Change-O (DefineClones.py) - Single-linkage hierarchical clustering.

    • Command: DefineClones.py -d <input.tsv> --act set --model ham --norm len --dist 0.10
    • Key Parameter: Hamming distance threshold (0.10 for nucleotide).
  • Tool 2: scoper (spectralClustering) - K-means-like clustering on phylogenetic distance.

    • R Script:

  • Tool 3: immuneSIM (for synthetic ground truth comparison) - In silico repertoire generation.

    • R Script:

  • Tool 4: partis (v0.17.0) - HMM-based annotation and clustering.

    • Command: partis annotate --infname input.fasta --outfname partis_output.yaml --all-annotations

4. Results and Quantitative Comparison Key metrics were extracted from each tool's output for the top 10 most expanded clones (by read count) in a sample.

Table 1: Clonal Assignment Concordance Across Tools

Clone Rank (by Tool1 Count) Tool1 (Change-O) Clone ID Tool2 (scoper) Clone ID Tool3 (partis) Clone ID Sequences in Intersection % Agreement (Pairwise, Tool1 vs.)
1 Clone_1 Cluster_A Group_alpha 1250 89% (vs. Tool2), 78% (vs. Tool3)
2 Clone_2 Cluster_B Group_beta 980 95% (vs. Tool2), 82% (vs. Tool3)
3 Clone_3 Cluster_C Group_gamma 450 75% (vs. Tool2), 65% (vs. Tool3)
... ... ... ... ... ...
Aggregate (Top 10) 10 distinct 10 distinct 14 distinct - Avg: 86% (T1vT2), 75% (T1vT3)

Table 2: Tool Performance and Runtime Metrics

Tool Algorithm Type Key Distance Metric Computational Time (per 10k seq) Memory Peak (GB) Outputs AIRR Format?
Change-O Hierarchical Clustering Hamming (nucleotide) ~2 min 1.2 Yes
scoper Spectral Clustering Hamming (AA) ~5 min 2.5 Yes
partis HMM Gluing Phylogenetic ~45 min 8.0 No (custom YAML)

5. Visualization of Analysis Workflow

G Start Public Dataset (COVID-19 Rep-Seq FASTQ) P1 Pre-processing (fastp, IgBLAST) Start->P1 P2 AIRR-formatted Sequence Table (.tsv) P1->P2 T1 Tool 1: Change-O P2->T1 T2 Tool 2: scoper P2->T2 T3 Tool 3: partis P2->T3 C1 Clonal Assignment Results (per tool) T1->C1 T2->C1 T3->C1 C2 Concordance Analysis & Metric Calculation C1->C2 End Consistency Report & HILARy Thesis Input C2->End

Title: Workflow for Multi-Tool Clonotype Consistency Study

6. The Scientist's Toolkit: Research Reagent Solutions

Item/Category Example Product/Software Function in HILARy Clonal Inference
Rep-Seq Wet-Lab Kit 10x Genomics Chromium Next GEM Single Cell 5' v2 Enables linked V(D)J and gene expression profiling from single B cells.
Sequence Annotation IMGT/HighV-QUEST, IgBLAST Provides standardized germline V/D/J gene assignment and sequence annotation.
Clonal Grouping Tool Change-O, scoper, partis, DADA2 (for denoising) Algorithms to cluster sequences originating from the same progenitor B cell.
Analysis Suite Immcantation Portal (pRESTO, Change-O, alakazam) A standardized pipeline suite for Rep-Seq data from raw reads to statistical analysis.
Synthetic Control immuneSIM, OLGA Generates in silico repertoires with known clonal relationships to benchmark tools.
Visualization & Reporting Dowser (for lineage trees), ggplot2 (R), AIRR Community Python libs Enables visualization of clonal lineages, diversity metrics, and publication-quality figures.
Data Standard AIRR Data Representation Standard Critical schema for data sharing and ensuring interoperability between different tools.

Conclusion

HILARy provides a robust and conceptually clear framework for inferring B cell clonal families from repertoire sequencing data, essential for deciphering adaptive immune responses. This guide has traversed from foundational biology through practical implementation, optimization, and validation. The key takeaway is that successful clonal inference requires a synergistic approach: pairing a well-understood algorithm like HILARy with rigorous data preprocessing, parameter optimization tailored to the biological question, and validation against benchmarks. For biomedical research, accurate clonal tracing is no longer a niche bioinformatics task but a critical component for discovering broad-neutralizing antibodies, understanding dysregulation in cancer and autoimmunity, and evaluating vaccine efficacy at a clonal level. Future directions point towards integrating single-cell multi-omics data, applying machine learning to refine lineage relationships, and developing standardized benchmarking platforms to propel the field towards more reproducible and clinically actionable insights.