Decoding B Cell Lineages: A Comprehensive Guide to HILARy Clonal Family Inference from Repertoire Sequencing Data

Layla Richardson Jan 12, 2026 639

This article provides a detailed guide to HILARy (Hierarchical Clustering for Lineage Analysis of Repertoires), a computational method for inferring B cell clonal families from high-throughput immune repertoire sequencing (Rep-Seq)...

Decoding B Cell Lineages: A Comprehensive Guide to HILARy Clonal Family Inference from Repertoire Sequencing Data

Abstract

This article provides a detailed guide to HILARy (Hierarchical Clustering for Lineage Analysis of Repertoires), a computational method for inferring B cell clonal families from high-throughput immune repertoire sequencing (Rep-Seq) data. Aimed at researchers and drug development professionals, we cover the foundational principles of B cell receptor diversification and the necessity of clonal inference. We then detail HILARy's methodological workflow, from data preprocessing to phylogenetic tree construction, and its applications in vaccine response tracking and autoimmune disease research. A dedicated troubleshooting section addresses common data quality and algorithmic challenges, while a comparative analysis validates HILARy against tools like partis and Change-O. The conclusion synthesizes best practices, highlights the translational impact on therapeutic antibody discovery and personalized medicine, and outlines future computational and experimental directions.

Understanding B Cell Clonality: Why HILARy Inference is Fundamental to Immunology Research

Application Notes

This document details experimental frameworks for studying B cell clonal dynamics, with data integrated into the HILARy (Hierarchical Inference of Lineage and Affinity from Repertoires) computational pipeline. The goal is to infer clonal families from B cell receptor (BCR) repertoire sequencing (RepSeq) data to elucidate the biological processes of somatic hypermutation (SHM), clonal expansion, and affinity maturation—critical for vaccine and therapeutic antibody development.

Key Quantitative Benchmarks in Affinity Maturation Studies The following table summarizes typical quantitative outcomes from germinal center reactions and in vitro maturation experiments.

Parameter	Typical Range/Value	Experimental Context	Relevance to HILARy Inference
SHM Rate (per bp per division)	~10⁻³ to 10⁻⁵	In vivo GC B cells	Basis for phylogenetic tree construction.
Average Mutation Load (IgG)	10-30 nucleotides	Memory B cells post-immunization	Distinguishes naive from expanded clones.
Clonal Expansion Factor	1x to >10,000x	Antigen-specific B cell numbers	Inferred from read depth and unique sequences.
Affinity Increase (Kd)	10 nM to 10 pM (100-10,000x)	After 4-6 rounds of in vitro maturation	Validates functional outcome of inferred lineages.
Productive Rearrangement %	~33% (1 in 3)	From bulk BCR-Seq	Filters non-functional sequences pre-inference.
Dominant Clone Frequency	Can exceed 50% of antigen-specific response	Response to protein antigens	Identifies key lineages for therapeutic ablation.

HILARy Integration Context: The quantitative data above provides prior expectations for the algorithm. For instance, mutation rates inform the nucleotide substitution model, while expansion factors help differentiate true clonal expansion from PCR duplication artifacts.

Experimental Protocols

Protocol 1: Antigen-Specific B Cell Isolation and BCR Repertoire Sequencing

Objective: Generate high-fidelity BCR heavy-chain (IGH) repertoire data from antigen-specific B cells for HILARy lineage inference.

Materials:

Antigen Conjugates: Biotinylated antigen of interest for fluorescent tagging.
Magnetic/Cell Sorting: Streptavidin-coated magnetic beads or FACS Aria.
Cell Lysis Buffer: For single-cell RNA/DNA extraction.
Reverse Transcription Primers: Oligo-dT or gene-specific primers for Ig constant regions.
Multiplex PCR Primers: V-gene family-specific forward primers and isotype-specific reverse primers.
High-Fidelity DNA Polymerase: e.g., Q5 or KAPA HiFi to minimize PCR errors.
Next-Generation Sequencing Platform: Illumina MiSeq/Novaseq with 2x300bp kits for full-length VDJ.

Methodology:

Cell Staining & Sorting: a. Suspend PBMCs or splenocytes in FACS buffer. b. Stain with fluorescently labeled (e.g., PE) antigen-biotin-streptavidin complex. c. Include antibodies for B cell markers (CD19+, CD20+) and exclusion markers (CD3-, CD14-). d. Sort double-positive (antigen+, CD19+) single B cells into 96-well plates containing lysis buffer. Sort an equivalent number of antigen-negative B cells as a control.

Single-Cell BCR Amplification: a. Perform reverse transcription using a primer targeting the Ig constant region. b. Conduct a first-round multiplex PCR using V-gene family primers. c. Perform a second-round nested PCR with barcoded Illumina adapters to add unique molecular identifiers (UMIs). d. Purify amplicons.
Library Preparation & Sequencing: a. Quantify purified PCR products and pool equimolarly. b. Prepare sequencing library following platform-specific protocols. c. Sequence on an Illumina platform to achieve >1000x coverage per cell.
Data Processing for HILARy: a. Use tools like pRESTO or MiXCR for UMI-aware consensus assembly, V(D)J alignment, and error correction. b. Output a filtered, high-quality FASTA file of IGH VDJ nucleotide sequences with associated metadata (isotype, UMI count). c. This curated FASTA is the direct input for the HILARy inference pipeline.

Protocol 2: In Vitro Affinity Maturation & Kinetic Analysis

Objective: Validate the functional significance of inferred clonal lineages by expressing antibodies and measuring affinity improvements correlating with SHM patterns.

Materials:

Expression Vectors: Mammalian (e.g., HEK293) or prokaryotic (e.g., scFv phage display) systems.
Site-Directed Mutagenesis Kits: To revert or introduce specific mutations from lineage.
Surface Plasmon Resonance (SPR) Chip: CMS sensor chip for antigen immobilization.
BLI (Bio-Layer Interferometry) System: Alternative to SPR for kinetic measurements.

Methodology:

Antibody Expression: a. Synthesize and clone the VDJ sequences of putative progenitor and descendant antibodies from the HILARy-inferred tree into IgG1 expression vectors. b. Co-transfect heavy and light chain plasmids into Expi293F cells using a transfection reagent. c. Harvest supernatant after 5-7 days, purify antibodies using Protein A affinity chromatography.

Affinity Measurement via SPR: a. Dilute antigen to 10-50 µg/mL in sodium acetate buffer (pH 4.5-5.5) and immobilize on a CMS chip via amine coupling to reach ~100-200 Response Units (RU). b. Use HBS-EP+ (10mM HEPES, 150mM NaCl, 3mM EDTA, 0.05% v/v Surfactant P20, pH 7.4) as running buffer. c. Inject purified antibodies at 5 concentrations (e.g., 0.8 nM to 100 nM) over the antigen surface at a flow rate of 30 µL/min. d. Regenerate the surface with 10 mM Glycine-HCl (pH 2.0). e. Fit the resulting sensograms to a 1:1 Langmuir binding model using the Biacore Evaluation Software to calculate association rate (k_a), dissociation rate (k_d), and equilibrium dissociation constant (K_D = k_d/k_a).
Data Correlation: a. Plot the calculated K_D values against the mutational distance from the inferred germline sequence for each antibody. b. Statistically test (e.g., linear regression) for a correlation between increased affinity (lower K_D) and branch length in the HILARy phylogenetic tree.

Mandatory Visualization

Title: HILARy Repertoire Analysis and Validation Workflow

Title: Germinal Center Pathway for Affinity Maturation

The Scientist's Toolkit: Research Reagent Solutions

Item	Function in SHM/Clonal Expansion Research
Biotinylated Antigen	Critical for isolating antigen-specific B cells via streptavidin beads or FACS.
Anti-Human CD19/20 & Isotype Antibodies	For positive selection of B lymphocytes and isotype switching analysis.
Single-Cell Lysis & RT Kit	Preserves RNA from individual sorted B cells for accurate VDJ amplification.
Multiplex Ig Primer Sets	Amplifies the full diversity of V genes from limited template for RepSeq.
Unique Molecular Identifiers (UMIs)	Short random nucleotide tags that distinguish true biological sequences from PCR errors.
High-Fidelity Polymerase	Essential for minimizing polymerase errors during library prep to accurately call SHMs.
IgG Expression Vectors	Mammalian vectors for recombinant expression of inferred lineage antibodies.
Protein A/G Agarose	For purification of recombinant IgG from culture supernatants for binding assays.
SPR/BLI Consumables	Sensor chips and buffers for kinetic analysis of antibody-antigen interactions.
AID Inhibitor (e.g., HM13)	Chemical probe to inhibit AID activity in vitro, validating its role in observed SHM.

Application Notes: Integrating HILARy Clonal Family Inference into the Rep-Seq Pipeline

Repertoire sequencing (Rep-Seq) generates vast datasets of adaptive immune receptor sequences. The core challenge is transforming these raw sequences into biological insights about clonal expansion, diversity, and antigenic drivers. The HILARy (High-performance Inference of Lymphocyte Antigen Reactivity) framework addresses a critical bottleneck: accurately inferring clonal families—groups of cells descended from a common progenitor—from noisy, high-throughput sequencing data. Accurate clonal grouping is the foundational step for all downstream analyses, including identifying pathogenic or therapeutic clones.

Table 1: Quantitative Comparison of Key Clonal Inference Methods

Method	Core Algorithm	Key Strength	Primary Limitation	Typical Runtime (1M reads)
HILARy	Hierarchical clustering with probabilistic thresholding & phylogenetic refinement	High accuracy in distinguishing true somatic hypermutation from PCR/sequencing error; integrates V/J gene identity.	Computationally intensive for ultra-deep repertoires.	~60 minutes
Partis	Hidden Markov Model (HMM) with expectation-maximization	Simultaneously annotates V(D)J segments and clusters clones; models the recombination process.	Can be complex to parameterize for non-standard species.	~45 minutes
Change-O	Single-linkage clustering on Hamming distance	Fast, simple, and highly customizable with user-defined thresholds.	Accuracy highly dependent on user-selected, fixed distance thresholds.	~15 minutes
Decombinator	Tag-based clustering followed by annotation	Extremely fast initial clustering using unique molecular identifiers (UMIs) and core tags.	Less effective for highly mutated sequences where core tags diverge.	~5 minutes

The HILARy protocol emphasizes a two-stage validation: first, in silico validation using simulated datasets with known ground truth; second, experimental validation using spike-in control clones or paired single-cell sequencing data to confirm clonal relationships.

Detailed Protocols

Protocol 1: HILARy-Based Clonal Family Inference from Raw FASTQ Files

Objective: To process raw Rep-Seq reads into annotated, clonally grouped data ready for immune repertoire analysis.

Materials & Reagent Solutions:

Raw Sequencing Data: Paired-end FASTQ files from platforms like Illumina MiSeq/NextSeq.
UMI-tagged Libraries: Essential for error correction and accurate PCR duplicate removal.
HILARy Software Suite: Available via GitHub, requires installation of dependencies (Python >=3.8, R >=4.0).
Reference Databases: IMGT/V-QUEST database for germline V, D, J gene alignment.
High-Performance Computing (HPC) Cluster: Recommended for full-scale analysis.

Procedure:

Preprocessing & Alignment:
- Use pRESTO or MiXCR to perform quality filtering, paired-read assembly, and UMI-based consensus building.
- Align consensus sequences to germline V, D, and J gene segments using IgBLAST or integrated alignment within partis.
Clonal Inference with HILARy:
- Execute the primary HILARy algorithm: hilarity infer --input annotated_seq.json --output clusters.json.
- The algorithm performs: a. Initial grouping by identical V and J gene assignments and CDR3 length. b. Hierarchical clustering within groups using a normalized Hamming distance metric on the CDR3 nucleotide sequence. c. Application of a model-based threshold that accounts for sequencing error and somatic hypermutation rate, rather than a fixed distance cutoff. d. Optional phylogenetic refinement to resolve ambiguous edges.
Post-processing:
- Generate a clonal abundance table (counts per clone) and a detailed lineage report.
- Export data in standard formats (.tsv, .json) for downstream tools like Immunarch or VDJtools.

Protocol 2: Experimental Validation of Inferred Clones via Single-Cell Sequencing

Objective: To validate computationally inferred clonal families using paired single-cell RNA-seq (scRNA-seq) with V(D)J enrichment.

Materials & Reagent Solutions:

Cryopreserved PBMCs or Tissue Sample: From the same donor as bulk Rep-Seq.
Chromium Controller & Kit (10x Genomics): For single-cell partitioning and library prep (e.g., Single Cell 5' v2 with Feature Barcoding).
Cell Ranger VDJ Pipeline: For processing single-cell V(D)J data.
Custom R/Python Scripts: For cross-modality data integration.

Procedure:

Single-Cell Library Preparation: Prepare libraries according to the 10x Genomics protocol for Gene Expression and V(D)J enrichment.
Sequencing & Data Processing: Sequence on an Illumina platform and process through the cellranger vdj pipeline to obtain per-cell clonotype calls.
Integration with HILARy Output:
- Map the bulk Rep-Seq clones (from HILARy) to single-cell clonotypes by comparing CDR3 nucleotide sequences and V/J gene usage.
- Validate that sequences HILARy grouped into one clone are found in single cells sharing the same clonotype.
- Calculate validation metrics: Precision (What fraction of computationally inferred clones are confirmed by single-cell data?) and Recall (What fraction of single-cell clonotypes were captured by the bulk inference?).

Visualizations

Title: Core Rep-Seq Workflow with HILARy

Title: HILARy Clonal Inference Steps

The Scientist's Toolkit: Research Reagent & Software Solutions

Table 2: Essential Resources for Rep-Seq Analysis

Item	Function/Description	Example Vendor/Resource
UMI Adapters	Unique Molecular Identifiers linked to each starting molecule during library prep, enabling accurate error correction and removal of PCR duplicates.	IDT, Twist Bioscience
IMGT Database	The international reference for immunoglobulin and T-cell receptor germline gene sequences. Critical for accurate V(D)J alignment.	IMGT.org
IgBLAST	Standard tool for aligning antigen receptor sequences to germline V, D, and J genes.	NCBI
pRESTO Toolkit	Suite of Python scripts for processing raw Rep-Seq reads (quality control, assembly, UMI handling).	pRESTO on GitHub
MiXCR	Comprehensive, all-in-one software for Rep-Seq data analysis from raw reads to clonal quantification.	MiXCR by Milaboratory
Immunarch R Package	Powerful R package for downstream repertoire analysis, visualization, and diversity estimation.	Immunarch on GitHub
10x Genomics Chromium	Platform for generating paired single-cell gene expression and V(D)J data for experimental validation.	10x Genomics
Cell Ranger	Official software suite for processing data from 10x Genomics single-cell V(D)J experiments.	10x Genomics

In B-cell and T-cell receptor (TCR) repertoire sequencing, a clonal family comprises a set of lymphocyte descendants originating from a single, antigen-naïve progenitor. Accurate clonal family inference is fundamental for studying adaptive immune responses, tracking clonal expansion in disease, and identifying targets for therapeutic development. This protocol details the core concepts and methodologies for defining clonal families within the context of the HILARy (Hierarchical Inference of Lymphocyte Antigen Receptor) framework, integrating V(D)J gene sharing, CDR3 similarity, and probabilistic germline reconstruction.

Key Concepts & Data

Clonal members originate from the same germline V, D, and J gene segments. Allelic differences must be accounted for.

Table 1: Criteria for V(D)J Gene Assignment in Clonal Grouping

Gene Segment	Matching Requirement	Common Allele Handling Method
V Gene	Identical gene, allowing for allelic ambiguity	IMGT database alignment with a 97-100% identity threshold.
D Gene	Identical gene (highly permissive due to trimming)	Required for BCR heavy chains and TCR β/δ chains. Often inferred.
J Gene	Identiguous gene	Critical for junctional boundary definition.

CDR3 Amino Acid Sequence Similarity

The complementarity-determining region 3 (CDR3) is the hypervariable core of the antigen-binding site. Clonal relatives exhibit highly similar CDR3 sequences.

Table 2: Common CDR3 Similarity Metrics & Thresholds

Metric	Description	Typical Clonal Threshold
Hamming Distance	Count of amino acid substitutions.	≤ 2 for sequences of equal length.
Levenshtein Distance	Count of insertions, deletions, and substitutions.	≤ 3-4, adjusted for sequence length.
Normalized Identity Score	(Identical positions) / (alignment length).	≥ 0.85 (85% identity).

Germline Sequence Reconstruction

Inference of the original, unmutated germline V(D)J sequence of the founding B-cell is essential for studying somatic hypermutation (SHM) in B-cell lineages.

Table 3: Germline Reconstruction Algorithm Comparison

Algorithm/Tool	Core Methodology	Best For
Partis	Hidden Markov Model (HMM) based Bayesian inference.	High-accuracy BCR reconstruction with SHM.
IgPhyML	Phylogenetic model incorporating selection and mutation.	Evolutionary analysis of clonal trees.
SONAR	Combined alignment and phylogenetic approach.	TCR and multi-isotype analysis.

Application Notes & Protocols

Protocol 1: Initial Clonal Grouping via V(D)J and CDR3

Objective: Cluster raw repertoire sequencing reads into preliminary clonal families. Materials: Pre-processed, annotated sequence data (FASTQ/FASTA with VDJ assignments from IgBLAST, MixCR, or IMGT/HighV-QUEST). Procedure:

Gene-based Grouping: Partition all sequences into bins sharing identical V gene and J gene assignments.
CDR3 Length Filter: Within each bin, subgroup sequences by identical CDR3 nucleotide length.
Similarity Clustering: For each length-based subgroup, perform single-linkage clustering based on CDR3 nucleotide Levenshtein distance (e.g., threshold = 1).
Validation: Manually inspect clusters from high-frequency bins for potential over-splitting due to sequencing errors.

Objective: Reconstruct the germline progenitor sequence and refine clonal boundaries using a probabilistic model. Materials: Preliminary clonal clusters from Protocol 1. Procedure:

Input Preparation: For each preliminary cluster, extract multiple sequence alignment of V(D)J regions.
Germline Inference: Run the Partis algorithm (partis partition --infname input.csv) to simultaneously infer the most likely germline sequence and reassign sequences to clades based on a joint probability model of SHM and common ancestry.
Tree Construction: For each refined clonal family, build a phylogenetic tree using IgPhyML to visualize somatic evolution and validate lineage relationships.
Output: A final list of clonal families, each with a consensus germline sequence, all member sequences, and a phylogenetic tree.

Protocol 3: Validation by Synthetic Repertoires

Objective: Benchmark clonal inference accuracy using ground-truth synthetic data. Materials: Synthetic immune repertoire data (e.g., from ImmuneSIM or OLGA). Procedure:

Data Generation: Generate a synthetic repertoire with known clonal families, incorporating realistic mutation rates and sequencing error models.
Run Inference: Process the synthetic data through Protocols 1 and 2.
Calculate Metrics: Compare inferred families to ground truth using precision, recall, and F1-score.
- Precision: (True Positives) / (All Inferred Family Members)
- Recall: (True Positives) / (All True Family Members)

The Scientist's Toolkit

Table 4: Essential Research Reagent Solutions for Repertoire Sequencing & Clonal Analysis

Reagent / Material	Function & Application
5' RACE Primer Systems	Ensures capture of full-length V(D)J transcripts from RNA for unbiased repertoire prep.
Unique Molecular Identifiers (UMIs)	Short random nucleotide tags attached to each cDNA molecule to correct for PCR amplification bias and sequencing errors.
High-Fidelity Polymerase	Critical for accurate amplification of diverse receptor sequences with minimal error introduction.
IMGT Reference Directory	The gold-standard database of germline V, D, and J gene alleles for alignment and annotation.
Spike-in Synthetic Standards	Known sequences added to samples to quantify sequencing depth and calibrate error rates.
Barcode-Compatible Sequencing Kit	Enables multiplexed, high-throughput sequencing of multiple samples on platforms like Illumina MiSeq/NextSeq.

Visualization of Workflows and Relationships

HILARy Clonal Inference Workflow

Clonal Family Evolution from Germline

Logic of Clonal Family Definition

Application Notes

HILARy (Hierarchical Inference of Lymphocyte Antigen-Reactivity) is a computational tool designed to infer clonal families from Adaptive Immune Receptor Repertoire Sequencing (AIRR-seq) data. Its core innovation lies in a multi-step hierarchical clustering approach that moves beyond single-linkage clustering on V/J gene identity and CDR3 length, integrating sequence similarity to define biologically relevant B-cell or T-cell receptor lineages.

Within the broader thesis of clonal family inference, HILARy occupies a critical niche. It addresses the inherent trade-off between specificity and sensitivity in lineage definition. Simpler methods may over-cluster dissimilar sequences or under-cluster highly mutated relatives. HILARy's hierarchical approach aims to balance these by constructing a tree of potential family relationships, allowing for dynamic cutoff selection based on sequence composition and mutation load.

Table 1: Comparison of Key Clustering Methods in AIRR-seq Analysis

Method	Core Clustering Principle	Key Inputs	Primary Output	Strengths	Limitations HILARy Addresses
Single-linkage (CDR3-based)	Pairs sequences by exact CDR3 AA identity & V/J gene.	Nucleotide sequences, V/J calls.	Groups of identical CDR3s.	Simple, fast.	Fails to group somatic variants; misses expanded families.
Network-based (e.g., SONAR)	Connects nodes (sequences) based on graph distance thresholds.	Aligned sequences, genetic distance.	Network graphs of related sequences.	Visualizes complex relationships.	Can be computationally heavy; global threshold may not suit all families.
HILARy (Hierarchical)	Agglomerative clustering with multiple, adaptive thresholds.	V/J genes, CDR3 nucleotide, sequence alignment.	Hierarchical tree with defined clonal groups.	Adapts to mutation level; captures nuanced relationships.	Computational cost higher than single-linkage.
Phylogeny-guided	Builds phylogenetic trees from multiple sequence alignments.	High-quality MSA, evolutionary model.	Rooted phylogenetic trees.	Models evolutionary history.	Computationally intensive; requires curated input.

HILARy's algorithm typically proceeds through defined strata: 1) Primary grouping by V gene, J gene, and CDR3 length. 2) Secondary clustering within these groups based on nucleotide sequence similarity of the CDR3 region. 3) Iterative pairwise comparison and tree construction, often using a Hamming distance metric, to merge clusters. The final cluster assignment can be made by cutting the hierarchical tree at a distance threshold that may be informed by the estimated mutation rate.

Experimental Protocols

Protocol 1: Standard HILARy Clonal Family Inference from AIRR-seq Data

Objective: To process raw AIRR-seq data into defined clonal families using the HILARy hierarchical clustering approach.

I. Input Data Preparation

Starting Material: Demultiplexed FASTQ files from B-cell or T-cell receptor sequencing (e.g., IgH, TCRβ).
Sequence Annotation:
- Use tools like IgBLAST, MiXCR, or IMGT/HighV-QUEST to align sequences and assign V, D, J genes, and define the CDR3 region.
- Output must be in standardized AIRR-compliant format (e.g., .tsv) with columns for sequence_id, v_call, j_call, junction (CDR3 nucleotide), and junction_aa.
Data Filtering:
- Remove sequences without a productive V-J assignment.
- Remove sequences with stop codons within the CDR3.
- Optional: Remove sequences with low read counts or perceived PCR errors (using tools like pRESTO).

II. HILARy Clustering Execution

Primary Clustering (Gene & Length Binning):
- Group all sequences by identical v_call (or major allele), identical j_call, and identical nucleotide length of the junction field.
- Output: Initial bins of sequences presumed related by common ancestry.
Hierarchical Agglomerative Clustering within Bins:
- For each bin: a. Perform all-vs-all pairwise alignment of the junction nucleotide sequences. b. Calculate genetic distance (e.g., Hamming distance for equal length sequences). c. Construct a distance matrix. d. Apply an agglomerative hierarchical clustering algorithm (e.g., average-linkage) to the distance matrix to build a tree.
Tree Cutting & Cluster Definition:
- Cut the hierarchical tree using a distance threshold (d). Common practice sets d ≤ 0.1 (10% nucleotide difference) for B-cell receptors to account for somatic hypermutation, but this is tunable.
- Sequences within each resulting sub-tree are assigned a shared clone_id.

III. Post-processing & Validation

Lineage Consolidation: Review clusters for potential merging if sub-clusters share a common ancestor just beyond the strict cutoff, based on biological plausibility.
Output Generation: Create a final table with sequence_id, clone_id, and all annotation fields. Generate summary statistics: number of clones, clone size distribution, etc.
Validation: Perform basic sanity checks:
- Visualize clone size distribution (should follow power-law).
- Align sequences within a large clone to confirm shared mutations and common ancestry.

Protocol 2: Validating HILARy Clusters via Phylogenetic Analysis

Objective: To independently validate the biological relevance of a HILARy-inferred clonal family by constructing a maximum-likelihood phylogenetic tree.

Select Clone of Interest: Choose a large or biologically significant clone from HILARy's output.
Multiple Sequence Alignment (MSA): Extract the full V(D)J nucleotide sequences for all members. Perform a high-quality MSA using MAFFT or Clustal Omega.
Model Selection & Tree Building: Use IQ-TREE or RAxML to:
- Find the best-fit nucleotide substitution model.
- Construct a maximum-likelihood phylogenetic tree (with 1000 bootstrap replicates).
Comparison: Overlay the HILARy cluster assignment on the phylogenetic tree leaf nodes. A valid HILARy cluster should form a distinct, well-supported monophyletic clade on the phylogenetic tree, confirming its inference of common ancestry.

Diagrams

HILARy Clustering Workflow

HILARy Hierarchical Binning & Clustering

The Scientist's Toolkit

Table 2: Essential Research Reagent Solutions for HILARy Workflow

Item	Function in HILARy/AIRR-seq Analysis	Example/Note
AIRR-seq Library Prep Kit	Prepares cDNA libraries from RNA/DNA of B/T cells for NGS, incorporating unique molecular identifiers (UMIs).	Kits from 10x Genomics, iRepertoire, or Takara Bio. Essential for reducing PCR amplification bias.
IGH/TCR Reference Databases	Provides germline V, D, J gene sequences for accurate alignment and annotation.	IMGT, VDJServer databases. Critical for the first binning step in HILARy.
Sequence Annotation Pipeline	Software that aligns raw reads to reference genes, identifies CDR3s, and assigns V(D)J genes.	`IgBLAST`, `MiXCR`, `IMGT/HighV-QUEST`. Generates the structured input for HILARy.
HILARy Software Package	The core executable or script that performs the hierarchical clustering algorithm on annotated sequences.	Available via GitHub repositories or as part of larger toolkits like `ImmuneDB` or `VDJtools`.
High-Performance Computing (HPC) Environment	Provides the computational resources for pairwise distance calculations and hierarchical clustering on large datasets.	Local server cluster or cloud computing (AWS, Google Cloud). Necessary for scaling analysis.
Phylogenetic Analysis Suite	Independent validation tool to assess the monophyly of inferred clusters.	`IQ-TREE`, `RAxML`, `PhyML`. Used in Protocol 2 for biological validation.
AIRR Data Visualization Tool	Software for visualizing clone size distributions, lineage trees, and sequence alignments post-HILARy.	`Alakazam`, `VDJviz`, `Immunarch`. Helps interpret and present clustering results.

Accurate inference of B-cell and T-cell clonal families from repertoire sequencing (RepSeq) data is a cornerstone of modern immunology. Within the broader thesis on HILARy (High-Throughput Lymphocyte Antigen Receptor Analysis) clonal family inference, precise clonal tracking transforms raw sequencing data into biologically and clinically meaningful insights. This document outlines key research questions and provides detailed application notes and protocols that leverage accurate clonal inference.

Enabling Key Research Questions

Table 1: Core Research Questions Enabled by Accurate Clonal Inference

Research Domain	Key Enabling Question	Primary Output Metric	Clinical/Basic Relevance
Basic Immunology	How do antigen-driven selection pressures shape clonal lineage evolution over time?	Normalized Shannon Entropy of clonal tree; Selection strength (dN/dS ratio)	Basic
Vaccine Development	What defines the breadth, potency, and durability of antigen-specific clonal responses?	Clonal Expansion Index; Persistence time (weeks); Somatic Hypermutation (SHM) rate	Translational
Autoimmunity & Cancer	How do autoreactive or tumor-infiltrating lymphocyte (TIL) clones expand, diversify, and correlate with disease activity?	Public clone frequency; Clone size skewness; T-cell receptor (TCR) convergence score	Clinical
Immunotherapy Monitoring	Which T-cell clones expand post-checkpoint blockade or CAR-T therapy and correlate with response/toxicity?	Maximum clone frequency fold-change; Diversity pre/post (1-D Simpson Index)	Clinical Biomarker
Infection & Immune Memory	What is the clonal architecture of long-lived memory B/T cell pools following infection or vaccination?	Memory/Naive clone ratio; Clonal genealogy depth (tree nodes); SHM burden	Basic/Translational

Detailed Application Notes & Protocols

Protocol: Longitudinal Tracking of Antigen-Specific Clonal Dynamics

Objective: To quantify the expansion, contraction, and somatic evolution of antigen-specific B-cell clones following immunization.

Workflow Diagram:

Title: Workflow for Tracking B-cell Clonal Dynamics

Materials & Reagents: Table 2: Research Reagent Solutions for Longitudinal Clonal Tracking

Item	Function	Example Product/Cat. No.
Lymphoprep	Density gradient medium for PBMC isolation	STEMCELL Technologies, 07801
CD19 MicroBeads, human	Positive selection of B cells	Miltenyi Biotec, 130-050-301
SMARTer Human B-Cell Receptor	cDNA synthesis & amplification of IgH transcripts	Takara Bio, 634414
MiSeq Reagent Kit v3 (600-cycle)	High-throughput paired-end sequencing	Illumina, MS-102-3003
HILARy Clustering Software	V(D)J alignment, error correction, clonal family inference	HILARy-C GitHub
ELISpot Kit (Antigen-Specific)	Functional validation of identified clones	Mabtech, HUMAN IFN-γ/IL-21

Procedure:

Sample Collection & Processing: Collect peripheral blood at multiple time points (e.g., Day 0, 7, 14, 28, 180). Isolate PBMCs using Lymphoprep per manufacturer's protocol.
B-cell Enrichment: Isolate CD19+ B cells using magnetic-activated cell sorting (MACS) with CD19 MicroBeads.
Library Preparation: Extract total RNA. Generate B-cell receptor (BCR) amplicon libraries using the SMARTer Human B-Cell Receptor kit, targeting the IGH variable region.
Sequencing: Pool libraries and sequence on an Illumina MiSeq platform using a 2x300 bp paired-end kit to achieve >100,000 reads per sample.
Clonal Inference Analysis:
- Process raw FASTQ files through the HILARy pipeline:
  - Align reads to IMGT reference sequences.
  - Correct PCR and sequencing errors.
  - Clustering: Group sequences into clones using a hierarchical clustering algorithm based on V/J gene identity and CDR3 nucleotide similarity (default threshold: 0.85).
  - Lineage Tree Building: Construct maximum parsimony trees for each clone using SHM patterns.
Longitudinal Data Integration: Use custom R/Python scripts to track clone IDs across time points. Calculate Clone Frequency (% of total reads), SHM Burden (mutations per sequence), and Tree Complexity (number of nodes, branch length).
Functional Correlation: For clones of interest (e.g., high frequency, high SHM), synthesize recombinant antibodies for neutralization assays or correlate expansion with antigen-specific B-cell ELISpot data.

Protocol: Identifying Tumor-Reactive T-cell Clones for Biomarker Discovery

Objective: To identify and characterize tumor-infiltrating lymphocyte (TIL) clones that expand upon immune checkpoint inhibitor (ICI) therapy and correlate with clinical response.

Pathway Diagram:

Title: Tumor-Reactive T-cell Clone Expansion and Response Pathway

Materials & Reagents: Table 3: Research Reagent Solutions for TIL Clonal Biomarker Discovery

Item	Function	Example Product/Cat. No.
Tumor Dissociation Kit, human	Gentle enzymatic dissociation of solid tumors	Miltenyi Biotec, 130-095-929
CD8+ T Cell Isolation Kit, human	Enrichment of CD8+ T cells from TILs or PBMCs	STEMCELL Technologies, 17953
TCRβ Kit for RNA-Seq	Template-switch based TCR repertoire profiling	Takara Bio, 634409
Cell Ranger V(D)J	Primary analysis pipeline for TCR sequencing	10x Genomics, Software Suite
Clonotype Tracking Software (e.g., LICORN)	Cross-sample clonotype matching & tracking	LICORN
IFN-γ Secretion Assay Detection Kit	Functional validation of reactive clones	Miltenyi Biotec, 130-054-202

Procedure:

Sample Procurement: Obtain matched tumor biopsy (fresh or frozen) and peripheral blood samples pre-therapy and at an on-treatment timepoint (e.g., 6-12 weeks).
Single-Cell Suspension: Process tumor tissue using a human Tumor Dissociation Kit. Isolate PBMCs from blood via Ficoll gradient.
T-cell Enrichment (Optional): Isolate CD8+ T cells from TILs and PBMCs using negative selection kits.
TCR Sequencing Library Prep: For bulk analysis, extract total RNA and prepare TCRβ libraries using the Takara Bio kit. For single-cell resolution, use the 10x Genomics 5' Immune Profiling solution.
Clonal Inference & Tracking:
- Process data: For bulk, use Cell Ranger V(D)J or MixCR. For single-cell, use the 10x Cell Ranger V(D)J pipeline.
- Define clonotypes based on identical CDR3β amino acid sequences.
- Use a clonal tracking tool (e.g., LICORN) to identify "Expanded Shared Clonotypes" present in both tumor and post-therapy blood, with a significant increase in frequency (>5-fold).
Biomarker Correlation: Calculate the Clonal Expansion Score (CES) = Σ(Frequency_post-blood of shared clones). Correlate CES with clinical metrics (RECIST response, progression-free survival) using statistical tests (e.g., Cox proportional hazards model).
Functional Validation: For top candidate clones, sort single T cells expressing the identified TCR, clone the TCRα/β genes, and express them in reporter cells. Test reactivity against autologous tumor organoids or peptide-MHC multimers.

Data Presentation and Analysis

Table 4: Example Quantitative Output from a Melanoma Anti-PD-1 Therapy Study

Patient ID	Clinical Response	Pre-Treatment TCR Richness	Post-Treatment CES	# of Expanded Shared Clones	Max Clone Freq. in Blood (Post)
PT-01	Complete Response	45,623	0.087	12	2.41%
PT-02	Partial Response	38,451	0.041	5	1.22%
PT-03	Stable Disease	51,889	0.015	3	0.67%
PT-04	Progressive Disease	41,007	0.005	1	0.11%

Note: TCR Richness: Estimated number of distinct clonotypes; CES: Clonal Expansion Score.

Accurate clonal inference via methods like HILARy is not merely a computational task but a foundational tool that enables researchers to address profound questions in immunology and clinical oncology. The protocols outlined here provide a roadmap for translating repertoire sequencing data into insights about immune dynamics, with direct applications in developing prognostic biomarkers and monitoring therapeutic efficacy.

A Step-by-Step Workflow: Implementing HILARy for Clonal Family Inference in Practice

Within the broader thesis on HILARy clonal family inference from adaptive immune repertoire sequencing, accurate data preprocessing is the critical first step. This protocol details the preparation of raw FASTQ files from bulk or single-cell B/T cell receptor sequencing and their submission to IMGT/HighV-QUEST for comprehensive V(D)J gene annotation. Reliable annotation forms the foundation for downstream clonotype definition, lineage reconstruction, and somatic hypermutation analysis essential to the HILARy (Hierarchical Inference of Lymphocyte Antigen Receptor relationships) framework.

Research Reagent Solutions Toolkit

Table 1: Essential Reagents and Tools for Repertoire Sequencing Library Prep and Analysis

Item	Function/Description
UMI-containing RT Primers	Unique Molecular Identifiers (UMIs) enable PCR duplicate removal and accurate molecule counting, critical for quantitative clonal analysis.
Multiplex PCR Primers for V/Gene Families	Primer sets designed to amplify the diverse V gene segments with minimal bias, often using multiplexed, semi-degenerate approaches.
High-Fidelity DNA Polymerase	Essential for amplification with low error rates to minimize sequencing artifacts mistaken for somatic hypermutation.
Dual-Indexed Sequencing Adapters	Allow for sample multiplexing and reduce index hopping artifacts in Illumina platforms.
Size Selection Beads (e.g., SPRI)	For post-amplification clean-up and selection of correct amplicon size, removing primer dimers and large contaminants.
IMGT/HighV-QUEST	The international reference tool for standardized V(D)J gene and allele assignment, junction analysis, and amino acid translation.
pRESTO / IgBLAST / MiXCR	Alternative or complementary tools for initial read quality control, assembly, and local annotation.

Protocol: FASTQ File Preparation for IMGT Submission

Initial Quality Control and Demultiplexing

Raw Data Assessment: Using FastQC (v0.12.1), generate quality reports for all raw FASTQ files. Key metrics: Per-base sequence quality (Phred score >30 for core V(D)J sequence), adapter contamination, and sequence length distribution.
Demultiplexing: Use bcl2fastq (Illumina) or guppy_barcoder (Oxford Nanopore) to generate per-sample FASTQ files based on dual index reads. Verify expected read counts per sample.

Read Processing and Error Correction

This workflow is optimized for Illumina paired-end data with UMIs.

Merge Paired-End Reads: Use pRESTO (v0.7.1) AssemblePairs.py or PEAR to overlap R1 and R2, creating full-length amplicon sequences.
Identify and Annotate UMIs/Cell Barcodes: Extract UMI and cell barcode sequences from primer regions using pRESTO ParseHeaders.py. For single-cell data, associate reads with cell IDs.
Quality Filtering: Filter reads based on merged length (e.g., 250-550 bp for human IgG) and average quality score (Phred >30).

Deduplication by UMI: Group reads by UMI, align within groups, and build a consensus sequence to correct for PCR and sequencing errors. pRESTO's ClusterSets.py or UMI-tools can be used.
Primer/Constant Region Masking: Mask constant region and primer sequences prior to IMGT submission to avoid interference with V(D)J assignment.

Data Formatting for IMGT/HighV-QUEST

File Format Conversion: Ensure final sequences are in FASTA format. The header line should contain a unique sequence identifier.
Sequence Requirements: IMGT/HighV-QUEST requires nucleotide sequences of the rearranged V(D)J region. Ensure primers for V and J genes are trimmed/masked. The optimal length is between 250-500 nt.
Batch Splitting: For large datasets (>500,000 sequences), split FASTA files into batches of ≤ 300,000 sequences each, as per IMGT submission limits.

Protocol: Annotating V(D)J Genes with IMGT/HighV-QUEST

Online Submission and Parameter Selection

Access: Navigate to the IMGT/HighV-QUEST submission portal (https://www.imgt.org/HighV-QUEST/).
Upload: Upload the prepared FASTA file.
Parameter Configuration (Critical for HILARy):
- Species and Receptor Type: Select the correct species (e.g., Homo sapiens) and molecule type (Immunoglobulin or TR).
- Input Type: Choose "Rearranged nucleotide sequences."
- Results Detail: Select "Detailed view (+AA junction, +V-REGIONs, ...)" to obtain full amino acid translations and V-region alignments required for somatic hypermutation analysis.
- Allele Alignment Parameters: Use default parameters. The "Include results on alleles from the whole species" box is recommended for comprehensive allele assignment.

Interpretation of Key Output Files for Clonal Inference

Download the compressed result folder upon job completion. Key files include:

1_Summary.txt: Overall statistics (Table 2).
2_IMGT-gapped-nt-sequences.txt: Sequences with IMGT gapping (numbering for alignment).
3_Nt-sequences.txt: V(D)J gene and allele assignments per sequence.
6_Junction.txt: Details of the CDR3 nucleotide and amino acid sequence, including P/N nucleotide identification.

Table 2: Key Quantitative Metrics from IMGT/HighV-QUEST 1_Summary.txt

Metric	Description	Relevance to HILARy Analysis
Total submitted sequences	Count of input FASTA entries.	Baseline for preprocessing efficiency.
Identified V-D-J rearrangements	Number of sequences with a productive V, (D), J assignment.	Defines the starting set of potentially functional clones.
Productive sequences (%)	Percentage of sequences in-frame with no stop codon.	Primary filter for defining clonotypes.
V, D, J gene usage statistics	Frequency of each gene segment.	Identifies repertoire biases and informs prior probabilities.
Average mutation level (V-REGION)	Mean number of nucleotide substitutions in the V gene.	Central input for somatic hypermutation models in lineage construction.

Post-IMGT Processing for HILARy Input

Filter for Productive Sequences: Retain only sequences marked as "productive" in the IMGT output.
Extract Clonotype Signatures: For each sequence, define a preliminary clonotype key typically as: V_GENE + J_GENE + CDR3_AA_LENGTH. The exact nucleotide CDR3 sequence is used for precise grouping.
Collate Mutation Data: Parse the "V-REGION mutation" and "V-REGION identity %" fields to build the mutation matrix for sequences within each clonotype family.

Visualized Workflows

Title: FASTQ to Annotated Data Workflow

IMGT/HighV-QUEST Analysis and Data Extraction Logic

Title: IMGT Data Processing for HILARy Input

Within the broader thesis on advanced clonal family inference from Adaptive Immune Receptor Repertoire Sequencing (AIRR-seq) data, the HILARy (Hierarchical clustering based on nucleotide distance and germline proximity) algorithm represents a pivotal methodological advancement. It addresses the critical challenge of accurately grouping B-cell or T-cell receptor sequences into clonally related families—a foundational step for analyzing immune repertoire dynamics, somatic hypermutation patterns, and antigen-specific responses in vaccine development, oncology, and autoimmune disease research.

HILARy operates on two primary distance metrics, integrated into a hierarchical clustering framework.

Table 1: Core Distance Metrics in HILARy

Metric	Description	Calculation	Purpose in Clustering
Nucleotide Distance	Edit distance between the nucleotide sequences of Complementarity-Determining Region 3 (CDR3).	Hamming or Levenshtein distance, often normalized by CDR3 length.	Groups sequences with minimal somatic mutation divergence, indicating recent common ancestry.
Germline Proximity	Distance between the inferred germline Variable (V) and Joining (J) gene segments.	Boolean or weighted score based on identity of V and J gene assignments from IMGT/VDJdb.	Groups sequences that originate from the same germline rearrangement event, a prerequisite for clonality.

The algorithm typically employs an agglomerative hierarchical clustering approach, where sequences are initially individual clusters and are iteratively merged based on a composite distance measure combining the above metrics, until a user-defined threshold is reached.

Table 2: Typical HILARy Parameter Thresholds (from Literature)

Parameter	Typical Range	Impact on Clustering
Maximum CDR3 Nucleotide Distance	0.10 - 0.15 (normalized)	Lower value creates more, smaller clusters (strict). Higher value creates fewer, larger clusters (permissive).
V/J Gene Match Requirement	Must share identical V and J genes	Strict enforcement ensures only sequences from the same rearrangement are clustered.
Linkage Method (Agglomerative)	Single, Complete, or Average	Single linkage may chain distant sequences; Complete linkage is more conservative.

Application Notes & Protocols

Protocol 3.1: Input Data Preparation for HILARy Analysis

Objective: To process raw AIRR-seq data into the structured input required for the HILARy algorithm. Workflow:

Sequence Processing: Use pipelines like pRESTO, Immcantation, or MiXCR to:
- Demultiplex raw reads.
- Perform quality filtering and merging (for paired-end reads).
- Identify and correct PCR/sequencing errors.
V(D)J Assignment: Annotate each high-quality sequence with its germline V, D (if applicable), and J genes using a tool like IgBLAST against the IMGT reference database.
CDR3 Extraction: Precisely extract the nucleotide and amino acid sequence of the CDR3 region based on conserved motifs (e.g., cysteine at 104, tryptophan/phenylalanine at 118, IMGT numbering).
Formatting: Compile the following into a tab-separated value (.tsv) file:
- Sequence ID
- Nucleotide CDR3 sequence
- Assigned V gene
- Assigned J gene
- (Optional) Read count or UMI count.

Protocol 3.2: Executing HILARy Clustering

Objective: To cluster preprocessed sequences into clonal families. Software: Implement HILARy via custom scripts (Python/R) or within platforms like scirpy (for single-cell TCR data). Methodology:

Distance Matrix Computation:
- For all sequence pairs within the same sample that share identical V and J gene assignments, calculate the normalized nucleotide distance between their CDR3s.
- Store results in a pairwise distance matrix. Pairs with different V/J genes are assigned an infinite distance.
Hierarchical Clustering:
- Apply agglomerative hierarchical clustering (e.g., scipy.cluster.hierarchy.linkage) using the precomputed distance matrix and a specified linkage method (average recommended).
Cluster Formation:
- Cut the resulting dendrogram at the specified nucleotide distance threshold (hclust cutree function or equivalent).
- Assign each sequence to a clonal family (cluster) ID.
Output: A table mapping each sequence ID to its clonal family ID, along with cluster size and summary statistics.

Protocol 3.3: Validation and Downstream Analysis

Objective: To validate clustering results and perform biological interpretation. Methodology:

Internal Validation: Calculate cluster statistics (size distribution, mean intra-cluster distance, silhouette score) to assess clustering quality.
Lineage Tree Reconstruction: For each large cluster, use tools like IgPhyML or dnaml to infer a maximum-likelihood phylogenetic tree from the aligned CDR3 nucleotide sequences. This visualizes somatic hypermutation pathways.
Convergence Analysis: Compare CDR3 amino acid sequences across clusters from different samples/individuals to identify public clones or convergent responses.
Phenotype Integration: (For single-cell data) Overlay clonal assignment onto UMAP/t-SNE plots and correlate with transcriptional clusters or cell surface protein expression.

Visualizations

Title: HILARy Algorithm Workflow

Title: Hierarchical Clustering & Dendrogram Cutting

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Tools & Reagents for HILARy-Based Research

Item	Function / Description	Example / Provider
AIRR-seq Library Prep Kit	Enables cDNA synthesis, multiplex PCR amplification of V(D)J regions, and addition of sequencing adapters/UMIs.	Illumina TCR/BCR Solutions, Takara Bio SMARTer Human V(D)J, 10x Genomics Single Cell Immune Profiling
High-Fidelity DNA Polymerase	Critical for accurate amplification with minimal PCR errors that could be mistaken for somatic mutations.	KAPA HiFi HotStart, Q5 High-Fidelity (NEB)
UMIs (Unique Molecular Identifiers)	Short random nucleotides added to each transcript during cDNA synthesis to correct for PCR amplification bias and sequencing errors.	Integrated into commercial library prep kits.
IMGT/IGMT Database	The international reference for immunoglobulin and T-cell receptor germline gene sequences. Essential for V(D)J assignment.	https://www.imgt.org/
IgBLAST Software	The standard tool for aligning sequence reads to germline V, D, J genes and identifying the CDR3 region.	NCBI https://ncbi.github.io/igblast/
Immcantation Framework	A comprehensive suite of open-source software (pRESTO, Change-O, IgPhyML) for AIRR-seq data analysis from start to finish.	https://immcantation.readthedocs.io/
Scirpy Package	A scalable Python toolkit for analyzing single-cell TCR and BCR data, including clustering and integrative analysis.	https://scirpy.readthedocs.io/
High-Performance Computing (HPC) Cluster	Necessary for processing large-scale repertoire datasets (millions of sequences) and performing intensive phylogenetic calculations.	Local institutional HPC or cloud services (AWS, Google Cloud).

In the context of a thesis on HILARy (Heavy-Light Adaptive Repertoire) clonal family inference from B-cell receptor repertoire sequencing (RepSeq) data, clustering is a foundational step. The accurate grouping of nucleotide or amino acid sequences into clonal families—descendants of a common progenitor B cell—is paramount for understanding adaptive immune responses, identifying disease correlates, and informing therapeutic antibody discovery. The fidelity of this clustering hinges critically on two algorithmic parameters: the distance threshold (the maximum dissimilarity for sequences to be grouped) and the linkage criterion (the rule defining the distance between clusters). This document provides application notes and protocols for empirically determining these parameters to achieve optimal, biologically-relevant clustering.

Core Quantitative Metrics & Performance Benchmarks

The following table summarizes key performance metrics and common parameter ranges derived from current literature in BCR clonal clustering.

Table 1: Common Clustering Metrics, Parameters, and Their Interpretations

Metric/Parameter	Typical Range/Value	Description & Impact on Clustering
Hamming Distance Threshold	Nucleotide: 0.10 - 0.15Amino Acid: 0.20 - 0.30	Maximum normalized allowed mismatch. Lower values increase specificity (reduce false mergers) but risk splitting true families. V(D)J mutation patterns guide selection.
Linkage Criteria	Single, Complete, Average, Ward	Single: Chain-sensitive, merges clusters based on nearest neighbors. Prone to chaining.Complete: Conservative, uses farthest neighbors. Produces compact clusters.Average: Balanced compromise. Often recommended for RepSeq.Ward: Minimizes within-cluster variance. Can be sensitive to outliers.
Calinski-Harabasz Index	Higher is better.	Ratio of between-cluster dispersion to within-cluster dispersion. Used to compare clustering quality across different parameter sets.
Average Silhouette Score	-1 to +1 (Closer to +1 is better)	Measures how similar an object is to its own cluster compared to other clusters. Useful for validating threshold choice.
Cluster Purity (vs. Ground Truth)	0.0 - 1.0	If a known ground truth (e.g., spike-in clones) exists, measures the fraction of correctly assigned sequences in each cluster.
Number of Inferred Clones	Varies by sample depth & diversity	The primary output count. Should be stable across reasonable parameter perturbations. Extreme sensitivity indicates overfitting.

Experimental Protocol: Determining Optimal Distance & Linkage

This protocol outlines a systematic approach for parameter tuning using a combination of internal validation metrics and, where available, biological validation.

Protocol Title: Empirical Optimization of Clustering Parameters for BCR Clonal Inference

Objective: To determine the optimal combination of sequence distance threshold and linkage criterion for hierarchical agglomerative clustering of BCR RepSeq data that yields biologically plausible clonal families.

Materials & Input Data:

Pre-processed BCR sequencing data (VDJ junctions or full-length sequences, aligned and corrected).
A computational environment with clustering libraries (e.g., scipy, sklearn in Python).
(Optional) Known spike-in control sequences or paired heavy-light chain data for validation.

Procedure:

Data Preparation:
- For the region of interest (e.g., CDR3+V-region), compute an all-vs-all pairwise distance matrix. Common metrics include Hamming distance (for nucleotides) or Levenshtein distance (accounting for indels).
- Normalize distances by sequence length.
Parameter Grid Definition:
- Define a grid of values to test.
  - Distance Threshold (dt): e.g., [0.05, 0.10, 0.15, 0.20, 0.25, 0.30] for nucleotides.
  - Linkage Criteria (lc): ['single', 'complete', 'average'].
Clustering & Internal Validation Loop:
- For each (dt, lc) combination in the grid:
  - Perform hierarchical agglomerative clustering on the distance matrix using linkage method lc.
  - Cut the resulting dendrogram at the distance threshold dt to form flat clusters.
  - Calculate internal validation metrics (e.g., Calinski-Harabasz Index, Average Silhouette Score) for the resulting clustering. Note the total number of clusters.
Analysis & Primary Selection:
- Plot the validation metrics against the distance threshold for each linkage method.
- Identify the "elbow" or plateau region in the Calinski-Harabasz curve and the peak in the Silhouette score. The threshold within this stable region is a candidate for optimality.
- Assess the sensitivity of cluster count to small changes in dt; prefer a stable region.
Biological Validation (If Possible):
- Using Paired Heavy-Light Chain Data: Under the HILARy thesis framework, the independent clustering of heavy and light chains followed by pairing provides a powerful validation. The optimal parameters should maximize the concordance where heavy and light chains from the same single cell fall into clusters that are frequently paired.
- Using Spike-in Controls: If control clones are known, calculate cluster purity and completeness for each parameter set.
- Select the (dt, lc) combination that maximizes biological validity metrics.
Final Application & Reporting:
- Apply the selected optimal parameters to the full dataset.
- Report the chosen parameters, the validation metrics that led to their selection, and the final cluster statistics.

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Tools and Reagents for Clonal Clustering Research

Item	Function in Clustering Workflow
UMI (Unique Molecular Identifier)-based RepSeq Kit (e.g., 10x Genomics 5' VDJ, SMARTer)	Reduces PCR and sequencing errors, providing accurate consensus sequences which form the reliable input for distance calculation.
Alignment & Annotation Tool (e.g, IgBLAST, MiXCR)	Annotates V, D, J genes and CDR3 regions, enabling focused distance calculation on the relevant, hypervariable segments.
High-Performance Computing (HPC) Cluster or Cloud Instance	Essential for computing large, all-vs-all distance matrices (O(n²) complexity) for deep repertoires (>100,000 sequences).
Python/R Libraries (scipy.cluster.hierarchy, scikit-learn, phyloseq)	Provide optimized implementations of hierarchical clustering, distance metrics, and validation indices.
Synthetic BCR Control Libraries (Spike-ins)	Provide ground truth clonal lineages for benchmarking and tuning clustering parameters in a specific experimental setup.
Single-Cell BCR Sequencing Data	Serves as the gold-standard validation tool. The pairing of heavy and light chains from the same cell validates clonal family inferences made from bulk data.

Visualization of the Parameter Optimization Workflow

Diagram Title: Parameter Tuning Workflow for Clonal Clustering

Visualization of Linkage Criteria Impact on Cluster Formation

Diagram Title: Linkage Criteria Impact on Cluster Shape

This document provides detailed application notes and protocols for the downstream analysis of B cell or T cell receptor (BCR/TCR) clonal families, specifically following their inference using the HILARy (High-throughput Immune-system Lymphocyte Analysis via Repertoire sequencing) framework. The HILARy method enables the accurate grouping of repertoire sequencing reads into clonally related families originating from a common progenitor. The subsequent analysis of these families—through lineage tree reconstruction and mutation pattern dissection—is critical for understanding adaptive immune responses, the dynamics of somatic hypermutation (SHM), and for informing vaccine and therapeutic antibody development.

Key Analytical Objectives

The primary downstream objectives are:

Lineage Tree Reconstruction: To infer the phylogenetic relationship between sequences within a clonal family, depicting the hypothesized evolutionary history from a common unmutated ancestor.
Mutation Pattern Analysis: To quantify and characterize the nature of somatic mutations, including rates, spectra (transition/transversion biases), and targeting motifs (e.g., WRCH/DGYW hotspots).
Selection Pressure Analysis: To assess the evidence for antigen-driven selection within the clonal family by comparing observed replacement (R) and silent (S) mutation ratios in framework (FWR) and complementarity-determining regions (CDR).

Table 1: Core Quantitative Outputs from Downstream Clonal Analysis

Metric	Description	Typical Value/Range	Interpretation
Clonal Family Size	Number of unique sequences in the family	2 - 10⁴+	Indicates proliferation burst size.
Tree Depth	Maximum number of mutations from root to leaf	1 - 30+ SHMs	Reflects temporal extent or mutation intensity.
Tree Isomorphism	Degree of branching (e.g., star-like vs. linear)	Measured via Sackin index	Informs on synchronous vs. asynchronous expansion.
SHM Rate	Mutations per base pair in V region	~10⁻³ to 10⁻²	Overall mutation load.
Transition:Transversion (Ts:Tv) Ratio	Ratio of purinepurine/pyrimidinepyrimidine to other changes	~2.0 - 3.0 in mammals	Reflects biochemical bias of AID enzyme.
R/S Ratio (CDR)	Replacement to Silent mutation ratio in CDRs	Often >2.9	Suggests positive antigenic selection.
R/S Ratio (FWR)	Replacement to Silent mutation ratio in FWRs	Often <1.5	Suggests purifying/structural selection.
Focusing Factor	(R/S)CDR / (R/S)FWR	>1 indicates selection	Quantifies strength of antigen-driven selection.

Table 2: Essential Research Reagent Solutions & Tools

Item	Function	Example Product/Software
Multiple Sequence Alignment Tool	Aligns nucleotide sequences of clonal members.	Clustal Omega, MAFFT, IgBLAST.
Germline V/D/J Reference	Provides inferred unmutated ancestor sequence.	IMGT/GENE-DB, IgBLAST database.
Lineage Tree Building Algorithm	Reconstructs phylogenetic trees from aligned sequences.	dnaml (PHYLIP), IgPhyML, RAxML, neighbor-joining.
SHM Analysis Suite	Quantifies mutations, spectra, and hotspots.	Change-O, ShazaM, Immcantation framework.
Tree Visualization Software	Renders and annotates lineage trees.	ggtree (R), ETE Toolkit, FigTree.
High-Fidelity Polymerase	For accurate amplification during library prep.	KAPA HiFi, Q5.
UMI-labeled RT Primers	For consensus sequencing to reduce PCR errors.	Custom-designed primers.

Experimental Protocols

Protocol 4.1: Lineage Tree Reconstruction from a HILARy-Inferred Clonal Family

Objective: To generate a rooted phylogenetic tree depicting the somatic evolution of a B cell clone.

Materials:

Output from HILARy pipeline (a FASTA file of nucleotide sequences for a single clonal family).
IMGT/GENE-DB reference sequences.
Computing environment with IgBLAST, PHYLIP, and R installed.

Methodology:

Germline Reconstruction: For the clonal family FASTA, use IgBLAST with the -germline_db_V option against the IMGT database to identify the most likely germline V, D, and J genes. Use a tool like Change-O CreateGermlines.py to reconstruct the inferred, unmutated ancestral sequence.
Sequence Alignment: Add the inferred germline sequence to the FASTA file. Perform a multiple sequence alignment (MSA) using MAFFT (mafft --auto input.fasta > aligned.fasta). For BCRs, ensure alignment is codon-aware.
Tree Building: Use the aligned sequences (including germline as outgroup) to build a tree.
- Option A (Maximum Likelihood - Recommended): Use IgPhyML (specialized for Ig sequences) or RAxML with a nucleotide substitution model (e.g., GTR+G).
- Option B (Distance-based): Calculate a distance matrix (e.g., p-distance) and construct a tree via neighbor-joining using PHYLIP's dnadist and neighbor.
Rooting: Root the resulting tree using the inferred germline sequence as the explicit outgroup (the common ancestor).
Visualization & Annotation: Import the rooted tree file (Newick format) into R using the ggtree package. Ancode nodes by mutation count, and highlight sequences with shared mutations.

Protocol 4.2: Analysis of Somatic Hypermutation Patterns

Objective: To characterize the type, distribution, and selection pressure of somatic mutations.

Materials:

A clonal family alignment and its rooted lineage tree.
Annotation of CDR/FWR boundaries (from IMGT numbering).
R with ShazaM and dplyr packages.

Methodology:

Mutation Calling: Using the reconstructed germline and the aligned sequences, create a mutation map. The shazam function observedMutations calculates the number of R and S mutations per sequence.
Mutation Spectrum: Tabulate the count of each type of substitution (A>G, A>C, A>T, etc.). Calculate the overall Transition (Ts: A<>G, C<>T) to Transversion (Tv: all others) ratio.
Targeting Motif Analysis: For each mutation, extract the 5-nucleotide context (e.g., the WRC motif on the positive strand where AID preferentially acts). Use shazam to calculate the observed vs. expected mutation frequency in known hotspot motifs.
Selection Pressure (BASELINe): This is the gold-standard method.
- Use the calcBaseline function in shazam to model the expected mutational probability for each CDR and FWR region based on the sequence's nucleotide content and mutability model (e.g., S5F).
- Compare the observed R/S distributions to these expected null distributions to generate a Bayesian posterior distribution for selection strength.
- A positive selection score (CDR) indicates antigen-driven selection; a negative score (FWR) indicates purifying selection.

Mandatory Visualizations

Lineage Tree Reconstruction Workflow

Example Annotated B Cell Lineage Tree

This Application Note details protocols for analyzing B cell receptor (BCR) repertoire sequencing data to track antigen-specific lineages and identify pathogenic clones. These methods are framed within the broader thesis of High-Inference Lineage Assembly and Reconstruction (HILARy) clonal family inference. HILARy provides a statistical framework for accurately grouping BCR sequences into clonal families based on V(D)J gene usage and junctional homology, which is the critical first step for downstream applications in vaccinology and autoimmunity research.

Application Note: Tracking Vaccine-Specific B Cell Lineages

Following vaccination, B cells recognizing the vaccine antigen undergo clonal expansion and somatic hypermutation. Tracking these lineages over time allows researchers to quantify the breadth, depth, and maturation of the humoral immune response.

Key Quantitative Findings from Recent Studies (2023-2024)

Table 1: Vaccine-Specific B Cell Lineage Dynamics

Parameter	Influenza mRNA Vaccine (Study A)	SARS-CoV-2 Booster (Study B)	RSV Pre-F Vaccine (Study C)
Time to Peak Lineage Expansion	7-10 days post-vaccination	14 days post-booster	10-12 days post-vaccination
Avg. Clonal Family Size (Peak)	45 sequences	120 sequences	28 sequences
Avg. Lineage Mutation Rate (SHM)	8.2%	6.5%	5.1%
Persistence (>6 months)	12% of expanded lineages	25% of expanded lineages	Data pending
Cross-Reactive Lineages	35% showed binding to historical strains	15% neutralized XBB.1.5 variant	60% bound both A & B RSV strains

Detailed Protocol: Enrichment and Sequencing of Antigen-Specific B Cells

Protocol 2.3.1: Antigen-Specific B Cell Sorting and BCR-Seq Objective: To isolate vaccine-antigen binding B cells and obtain paired heavy-light chain BCR sequences.

Materials:

PBMCs or lymphoid tissue from vaccinated subjects (pre-vax, day 7, day 14, day 28+).
Biotinylated vaccine antigen (e.g., spike protein, HA protein).
Fluorescent Streptavidin & B cell phenotyping antibodies (CD19, CD20, CD27, CD38, IgD).
Fluorescence-Activated Cell Sorter (FACS).
Single-cell RT-PCR kit for full-length V(D)J amplification (e.g., SMARTer Human BCR).
High-throughput sequencer (Illumina MiSeq/NextSeq).

Procedure:

Staining: Stain 10-20 million PBMCs with biotinylated antigen, followed by fluorescent streptavidin and phenotyping antibodies.
Sorting: Use FACS to sort single, live, antigen⁺ memory B cells (CD19⁺CD20⁺IgD^-CD27⁺) and plasmablasts (CD19⁺CD20^low/-CD27⁺⁺CD38⁺⁺).
Library Prep: Perform single-cell lysis and reverse transcription. Amplify full-length IgG heavy and light chain transcripts using V gene primer sets and template-switching.
Sequencing: Pool libraries and sequence on a 2x300bp MiSeq run to achieve high-quality, full-length coverage.

Detailed Protocol: HILARy-Based Clonal Lineage Inference

Protocol 2.4.1: Constructing Clonal Families from Sorted BCR-Seq Data Objective: To apply the HILARy framework for accurate clonal grouping and lineage tree construction.

Procedure:

Pre-processing: Use tools like pRESTO and Change-O for demultiplexing, quality filtering, and V(D)J assignment (IgBLAST).
Clonal Grouping (HILARy Core): a. Define initial clusters by identical IGHV and IGHJ genes and CDR3 nucleotide length. b. Calculate pairwise distances between CDR3 regions within each cluster using a modified Hamming distance. c. Apply a hierarchical clustering algorithm with a dynamic threshold that accounts for sequencing error and SHM. d. Merge clusters if the median distance between them is below the empirically derived threshold (typical range: 0.10-0.15).
Lineage Tree Construction: For each clonal family, align sequences (MAFFT) and construct a maximum-likelihood phylogenetic tree (IgPhyML) to model SHM and selection.
Downstream Analysis: Annotate trees with time points, calculate SHM rates, and identify convergent antibody sequences across donors.

Title: Workflow for Tracking Vaccine-Specific B Cell Lineages

Application Note: Identifying Pathogenic Clones in Autoimmunity

In autoimmune conditions like lupus (SLE) and rheumatoid arthritis (RA), self-reactive B cell clones escape tolerance. Identifying these pathogenic clones from bulk repertoire data is crucial for understanding disease mechanisms and developing targeted therapies.

Key Quantitative Findings from Recent Studies (2023-2024)

Table 2: Pathogenic B Cell Clones in Autoimmunity

Characteristic	Systemic Lupus Erythematosus	Rheumatoid Arthritis (Anti-Citrullinated Protein)	Multiple Sclerosis
Typical Enrichment in Tissue	Kidney (Lupus Nephritis): 5-15x vs blood	Synovium: 20-50x vs paired blood	CSF: 10-30x vs paired blood
Avg. SHM in Pathogenic Clones	11.5%	9.8%	8.2%
Clonal Family Size	Large, often >100 sequences	Moderate, 20-80 sequences	Variable, often expanded in CSF
Recurrent V Gene Usage	IGHV4-34 (anti-dsDNA)	IGHV1-69/IGHV4-39 (anti-CCP)	IGHV4-34, IGHV3-15
Evidence of Antigen Drive	Strong (R/S ratio >3 in CDR)	Strong (R/S ratio >2.8 in CDR)	Moderate (R/S ratio ~2.5)

Detailed Protocol: Identifying Tissue-Restricted Pathogenic Clones

Protocol 3.3.1: Paired Tissue-Blood Repertoire Profiling and Analysis Objective: To identify clones expanded in diseased tissue compared to autologous blood, suggesting local antigen drive.

Materials:

Paired samples: Diseased tissue (e.g., kidney biopsy, synovial fluid) and peripheral blood.
Single-cell RNA-seq (scRNA-seq) kit with V(D)J enrichment (e.g., 10x Genomics 5' V(D)J).
Bioinformatic pipelines: Cell Ranger V(D)J, Seurat.

Procedure:

Sample Processing: Generate single-cell suspensions from tissue and blood. Isolate live CD19⁺ B cells.
Library Construction: Use 10x Genomics 5' gene expression with V(D)J enrichment kit according to manufacturer protocol.
Sequencing: Sequence libraries to a depth of ~50,000 reads per cell for gene expression and full coverage for BCR.
Integrated Analysis: a. Process data with Cell Ranger V(D)J and integrate gene expression (GEX) and BCR data using Seurat. b. Apply HILARy framework separately to tissue and blood BCR data to define clonal families. c. Identify tissue-restricted clones: Calculate a tissue enrichment score: (Clone size in tissue / Total tissue B cells) / (Clone size in blood / Total blood B cells). Clones with a score >10 and absolute presence >5 cells in tissue are flagged. d. Correlate clone phenotype via GEX: e.g., expression of pathogenic markers (e.g., TNF, IL6, ITGAX for age-associated B cells).

Detailed Protocol: Functional Validation of Pathogenicity

Protocol 3.4.2: Recombinant Antibody Expression and Autoreactivity Testing Objective: To confirm the autoreactivity of BCR sequences identified from pathogenic clonal families.

Procedure:

Clone Selection: Select 2-3 dominant sequences from each candidate pathogenic clonal family.
Recombinant Expression: Synthesize genes for heavy and light chain variable regions. Clone into human IgG1/kappa expression vectors. Co-transfect Expi293F cells using Expifectamine.
Purification: Harvest supernatant after 5 days. Purify IgG using Protein A affinity chromatography.
Binding Assays:
- ELISA: Test binding to candidate autoantigens (e.g., dsDNA, citrullinated peptides, myelin basic protein).
- Immunofluorescence: On HEp-2 cells or primary tissue sections.
Functional Assays: Test for complement deposition (C1q binding assay) or stimulation of reporter cells (e.g., NF-κB activation in HEK-Blue TLR9 cells by DNA-immune complexes).

Title: Workflow for Identifying Pathogenic Clones in Autoimmunity

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Reagents and Tools for B Cell Lineage & Pathogenic Clone Studies

Item	Function/Application	Example Product/Catalog
Biotinylated Antigens	Label antigen-specific B cells for FACS sorting. Critical for vaccine studies.	SARS-CoV-2 S-2P Trimer (Acro Biosystems); HA Proteins (Sino Biological)
Single-Cell BCR Amplification Kit	Amplify paired heavy/light chains from single sorted B cells.	SMARTer Human BCR Profiling Kit (Takara Bio)
10x Genomics 5' V(D)J + GEX Kit	Integrated single-cell gene expression and V(D)J sequencing from tissue.	10x Genomics Chromium Next GEM Single Cell 5'
Human IgG Expression Vector	For recombinant expression of candidate pathogenic or vaccine-derived antibodies.	pFUSEss-CHIg-hG1, pFUSE2ss-CLIg-hk (Invivogen)
Expi293 Expression System	High-yield transient expression of recombinant antibodies for validation.	Expi293F Cells & Expifectamine (Thermo Fisher)
HILARy-Compatible Software	Bioinformatic pipeline for robust clonal family inference.	Custom R/Python scripts implementing HILARy algorithm (available on GitHub)
IgPhyML	Phylogenetic software designed for modeling B cell lineage trees with SHM.	IgPhyML (open source)

Overcoming Common Pitfalls: Best Practices for Optimizing HILARy Analysis and Data Quality

Within the broader thesis on High-Integrity Lymphocyte Antigen Receptor (HILARy) clonal family inference from adaptive immune receptor repertoire sequencing (AIRR-Seq), accurate delineation of clonally related B or T cell sequences is paramount. This process fundamentally relies on clustering nucleotide sequences derived from common progenitor lymphocytes. Two major technical artifacts—sequencing errors and PCR duplicates—severely distort the biological signal, leading to either over-fragmentation (false clusters due to errors) or over-merging (inflated clusters due to duplicates) of clonal families. This application note details their impacts and provides corrected, implementable protocols to ensure high-fidelity HILARy clonal inference for research and therapeutic discovery.

Quantitative Impact on Clustering Accuracy

Table 1: Impact of Artifacts on Clonal Clustering Metrics

Artifact	Primary Effect on Clustering	Typical Error Rate/Effect Size	Impact on Inferred Clonal Frequency
PCR Duplicates	Over-merging; reduces unique molecular count.	Can constitute 20-80% of raw reads, depending on protocol.	Can inflate frequency of dominant clones by >10-fold, skewing diversity indices.
Sequencing Errors (Substitutions)	Over-fragmentation; creates artificial diversity.	~0.1-1% per base (NGS platforms).	Creates low-frequency "phantom" clones, artificially increases richness.
Indel Errors (especially in CDR3)	Severe over-fragmentation; disrupts reading frame.	~0.01-0.1% per base, but impact is catastrophic.	Splits true clones into multiple, erroneous small families.
Chimeric PCR Products	Creates false, hybrid sequences.	Typically 0.5-2% of reads in multiplex PCR.	Generates biologically implausible clusters, confounding lineage analysis.

Table 2: Comparative Performance of Correction Strategies

Strategy/Method	Key Principle	Duplex Consensus Required?	Estimated Clustering Accuracy Recovery	Computational Demand
Unique Molecular Identifiers (UMI) with network-based correction	Deduplication via UMI sequence tags.	Yes (optimal)	>95% (for duplicates)	High
UMI with simple clustering	Basic UMI group consensus.	No	~85-90%	Medium
Read-based deduplication	Identical nucleotide sequence merging.	No	Handles duplicates only; 0% error correction	Low
Statistical error correction (e.g., Martin's Algorithm)	Expectation-maximization on aligned reads.	No	~80-90% (for errors)	Medium-High
Hybrid: UMI + Statistical Correction	Combines both approaches.	Yes	>95% (for both artifacts)	Very High

Detailed Experimental Protocols

Protocol 2.1: UMI-Based Duplicate Removal and Error Correction for HILARy Inference

Objective: To generate high-fidelity, error-corrected consensus sequences for each original cDNA molecule prior to clonal clustering. Materials: See "Research Reagent Solutions" table. Workflow:

Library Preparation: Use a 5' RACE-based AIRR-Seq kit that incorporates double-stranded UMIs (e.g., 12bp randomers) during cDNA synthesis.
Sequencing: Perform paired-end sequencing (2x300bp MiSeq/2x150bp NextSeq) to ensure full coverage of the UMI, V-region primers, and the full CDR3.
Pre-processing:
- Truncate reads at first base with Q<30. Trim primer/constant region sequences.
- Extract and record UMI sequences from the read header or initial bases.
UMI Clustering & Consensus Building (Critical Step):
- Align all reads to a reference IG or TR locus using a lightweight aligner (e.g, minimap2).
- Group reads by (a) sample barcode, (b) gene primer ID, and (c) UMI sequence (allowing for a Hamming distance of 1-2 to account for UMI synthesis/PCR errors).
- For each UMI-group, perform a multiple sequence alignment of the variable region.
- Generate a duplex consensus: Require that mutations (relative to the majority) be present on both forward and reverse strands from different PCR products (evidenced by different paired-end UMIs) to be retained as a true variant. Otherwise, revert to the majority base.
- Output one consensus sequence per original UMI group. The count of unique UMI groups represents the corrected molecular count.
Output: A FASTA/FASTQ file of consensus sequences for downstream V(D)J assignment and clonal clustering.

Protocol 2.2: Post-Sequencing Statistical Error Correction for Legacy Data

Objective: To correct sequencing errors in datasets lacking UMIs, enabling more accurate clustering. Materials: pRESTO or USEARCH suite, high-performance computing node. Workflow:

Initial Clustering: Cluster pre-processed reads at a permissive identity threshold (e.g., 96-98% nucleotide identity) using a greedy clustering algorithm (e.g., USEARCH -cluster_fast).
Build Multiple Sequence Alignment (MSA): For each cluster, perform a MSA (Muscle or MAFFT).
Error Model Application: Apply a probabilistic error model (e.g., within pRESTO's ClusterSets):
- For each column in the MSA, the base with the highest quality-score-weighted frequency is identified as the "true" base.
- Bases with low quality scores and low frequency in the column are deemed errors.
Generate Corrected Sequence: For each original sequence in the cluster, correct erroneous bases to the consensus of the cluster. Sequences that are predominantly error-filled may be discarded.
Re-cluster for Analysis: Use the corrected sequences for definitive clonal clustering at the operational threshold (e.g., 99% for IgH).

Visualization of Workflows

Diagram 1: Two Pathways for Error/Duplicate Correction

Diagram 2: Artifact Origin & Impact on Clustering

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Reagents and Tools for High-Fidelity HILARy Prep

Item	Function/Principle	Example Product/Kit
UMI-Integrated cDNA Synthesis Kit	Incorporates unique molecular identifiers at the earliest step to tag each original mRNA molecule.	Takara Bio SMARTer Human BCR/Ig Profiling Kit; 10x Genomics 5' Immune Profiling.
High-Fidelity PCR Enzyme Mix	Minimizes polymerase-induced errors during library amplification, preserving sequence integrity.	KAPA HiFi HotStart ReadyMix; Q5 High-Fidelity DNA Polymerase.
Dual-Indexed UMI-Compatible Adapters	Enables multiplexing and accurate pairing of reads back to sample and original molecule.	Illumina TruSeq UD Indexes; IDT for Illumina UMI Adapters.
Specialized Analysis Suites	Software toolkits designed for UMI processing, error correction, and AIRR-Seq analysis.	`pRESTO`, `Immcantation` framework, `MIXCR`.
Spike-in Control Libraries	Artificial sequences of known diversity and frequency to quantify duplication and error rates.	ERCC RNA Spike-In Mix; Sequins synthetic genomes.

Within the thesis on HILARy (High-resolution Inference of Lymphocyte Antibody Repertoires) clonal family inference, a primary challenge is ensuring robust analysis from suboptimal input data. Low-quality samples, characterized by low read counts, high PCR error rates, or poor template integrity, and sparse repertoires, with limited clonal diversity or depth, can significantly skew clonal clustering, lineage tree construction, and somatic hypermutation analysis. This document provides application notes and protocols for diagnosing these issues and systematically adjusting analytical parameters in repertoire sequencing (RepSeq) pipelines to maintain biological fidelity.

Diagnostic Criteria and Decision Framework

Before parameter adjustment, accurate diagnosis of data issues is crucial. The following thresholds, derived from current literature and benchmark studies (2023-2024), guide initial assessment.

Table 1: Diagnostic Criteria for Low-Quality and Sparse Repertoire Data

Metric	Optimal Range	Warning Zone	Action Required Zone	Primary Impact on HILARy Inference
Total Sequencing Reads	> 100,000	50,000 - 100,000	< 50,000	Reduced power for rare clone detection; unstable diversity metrics.
Reads per Unique Barcode	> 10	5 - 10	< 5	Inability to confidently correct PCR/sequencing errors via consensus.
Inferred Template Count	> 80% of reads	60% - 80%	< 60%	High noise-to-signal ratio; false unique variants inflate diversity.
Mean Phred Quality Score (Q30)	≥ 30	25 - 29	< 25	Increased base-calling errors, misassignment of SHM.
Clonal Richness (Chao1 Estimator)	Study-dependent	50% below control	70% below control	Sparse repertoire; clonal families may be artificially merged.
Minimum Spanning Tree (MST) Connectivity	Well-connected, single component	Multiple fragments	Highly fragmented	Lineage inference fails; SHM pathways are interrupted.

The decision to adjust parameters should follow a logical workflow.

Diagram Title: Decision Workflow for Parameter Adjustment

Protocol: Parameter Adjustment for Clustering and Error Correction

This protocol details steps for the immunoClust and Change-O pipelines, commonly used in HILARy frameworks.

Materials & Reagents

Table 2: Research Reagent Solutions & Computational Tools

Item	Function/Description
UMI (Unique Molecular Identifier)-linked RepSeq Library	Enables consensus-based error correction. Critical for low-quality inputs.
PhiX Control V3 (Illumina)	Spiked-in during sequencing for quality monitoring and error rate calibration.
Synthetic Immune Repertoire Spike-ins (e.g., AIRRscape Control Set)	External multiplex PCR controls for quantifying sensitivity and specificity of recovery.
High-Fidelity DNA Polymerase (e.g., Q5, KAPA HiFi)	Used in library amplification steps to minimize PCR errors pre-sequencing.
`immunoClust` (v2.0+)	Adaptive clustering algorithm; key for adjusting distance thresholds.
`Change-O`/`Alakazam` (v1.3.0+)	Suite for calculating SHM, building lineages, and assigning clonal groups.
`scRepertoire` (v1.10.0)	Useful for comparative visualization of sparse vs. dense repertoires.

Step-by-Step Protocol

A. Pre-processing and Quality Control Enhancement

Demultiplexing & UMI Assignment: Use pRESTO (v0.7.0+) with --align set to core (less stringent) for low-quality FASTQs.
Consensus Building: For samples with <5 reads/UMI, lower the --minqual threshold from default 20 to 15. Increase --minreads for forming a consensus from 2 to 3 to reduce spurious UMI groups.
Gene Assignment: In IgBLAST, for warning-zone samples, consider using the -num_alignments_V flag to report more germline V gene candidates (e.g., from 3 to 5) for ambiguous reads.

B. Adjusting Clonal Grouping (Clonal Family Inference) The core step for handling sparsity. Default nucleotide distance thresholds may be too stringent.

Calculate distance models: Run CreateGermlines (Change-O) to reconstruct germline sequences.
Run DefineClones.py with modified parameters:
- For sparse repertoires, relax the distance threshold. If the default is a normalized Hamming distance of 0.15, incrementally increase to 0.18 or 0.20.
- Use the --model ham (Hamming) instead of hs5f (5-mer substitution model) for shorter, noisier sequences.
- Implement the --act set criterion (allelic clustering threshold) if V gene assignment is poor.
- Critical: Always run a synthetic spike-in control with the same parameters to measure False Discovery Rate (FDR).

Table 3: Adjusted Clonal Grouping Parameters for Sparse Data

Parameter	Default Value	Adjusted Value (Sparse)	Rationale
Distance Threshold	0.15 (Normalized)	0.18 - 0.22	Prevents over-fragmentation of related sequences with higher error load.
Linkage Method	`single`	`average`	Reduces chaining effects in low-diversity samples.
Minimum Cluster Size	2	1	In very sparse data, singletons may be true, rare clones. Flag for later review.

C. Somatic Hypermutation (SHM) and Lineage Inference

SHM Calculation: In Alakazam, use observedMutations with sequenceColumn=sequence_alignment and germlineColumn=germline_alignment_d_mask. For low-quality data, apply a frequency=TRUE filter to ignore mutations seen in only one read.
Lineage Tree Building: Use Dowser with start==germline and min_seqs_per_node=1 (instead of 2) for fragmented families. Prioritize tree building via igraph layout for visualization.

Validation and Reporting

After adjustment, validation is mandatory.

Internal Validation: Compare clonal cluster size distribution (log-log plot) before and after adjustment. A persistent "bulge" at size=1 indicates unresolved sparsity.
External Validation: Use recovery metrics from synthetic spike-ins. Report both sensitivity (true positive rate) and precision (1 - FDR).
Reporting: Document all parameter changes in metadata. Use the AIRR (Adaptive Immune Receptor Repertoire) Community MinSEA standards for reporting.

Diagram Title: Post-Adjustment Validation Pathway

Handling low-quality and sparse repertoires requires a disciplined, diagnostic approach. Adjusting analytical parameters—specifically relaxing clonal distance thresholds, modifying UMI consensus rules, and validating with external controls—allows for biologically plausible HILARy inference from suboptimal data. These protocols ensure that conclusions drawn about clonal dynamics, vaccine response, or biomarker discovery remain robust despite technical data limitations. All adjustments must be transparently reported to maintain reproducibility.

Thesis Context: This document, part of a broader thesis on High-throughput Lymphocyte Receptor Analysis (HILARy) for clonal family inference from repertoire sequencing (RepSeq) data, addresses the critical challenge of ambiguous cluster assignments. These ambiguities arise when sequences, particularly those from converging or diverging lineages, exhibit distances that place them at the boundary of defined clonal clusters, complicating accurate lineage reconstruction and clonal tracking.

Quantification of Ambiguity in Clustering

Ambiguity typically manifests when the normalized Hamming or Levenshtein distance between a candidate sequence and two or more pre-defined clonal clusters falls within a poorly discriminant range. The following table summarizes key metrics and thresholds identified from recent literature for defining this "boundary region."

Table 1: Quantitative Boundaries for Ambiguous Cluster Assignment in B/T Cell Receptor Sequencing

Metric	Typical Clonal Threshold	Ambiguous Zone (Boundary)	Common Cause & Implication
Nucleotide Hamming Distance	≤ 0.10 (10% divergence)	0.10 – 0.15	Convergent evolution or shared V-gene motifs; may indicate separate lineages with common ancestors.
Amino Acid Levenshtein Distance	≤ 0.20	0.20 – 0.25	Selection pressure leading to phenotypic convergence; risks merging functionally distinct clones.
SHM (Somatic Hypermutation) Load	Clonal members: Similar SHM patterns	Mismatch in SHM "hotspots" > 30%	Sequences may be from temporally distinct responses (early vs. late germinal center); phylogenetic placement uncertain.
V/J Gene Identity	Must be identical for same clone	Same V gene, different J gene (or vice versa)	Possible lineage relationship vs. independent recombination event.
Cluster Size (No. of Unique Sequences)	Well-defined: > 5 members	Singleton or doubleton sequences	Could be technical artifact, highly expanded low-diversity clone, or true boundary case.

Strategic Framework for Resolution

A multi-algorithmic, evidence-weighted approach is required to resolve boundary cases. The logical workflow for this decision process is outlined below.

Diagram Title: Decision Workflow for Boundary Sequence Assignment

Detailed Experimental Protocols

Protocol 3.1: Phylogenetic Context Validation for Boundary Sequences

Objective: To determine if a boundary sequence nests monophyletically within an existing clonal cluster or sits basally between clusters.

Materials: See Scientist's Toolkit (Section 5). Procedure:

Sequence Alignment: For each candidate cluster (A, B, etc.) and the boundary sequence, perform multiple sequence alignment (MSA) of the CDR3 nucleotide region plus 50 flanking bases using MAFFT (--auto preset).
Tree Building: Generate a Neighbor-Joining (NJ) tree from the MSA using IgPhyML (or FastTree for rapid assessment) with the HKY85 substitution model.
Clade Assessment: Visualize the tree (e.g., with FigTree). Assess if the boundary sequence:
- Forms a distinct branch sister to a well-defined clade (cluster) with high bootstrap support (>70%).
- Falls within a clade with strong support.
- Sits on a long branch between clusters with low support.
Monophyly Test: Use the ape package in R to test if the boundary sequence and a candidate cluster form a monophyletic group to the exclusion of other clusters.

Protocol 3.2: Silent vs. Replacement Mutation Profiling

Objective: To compare the selection pressure profile of the boundary sequence with candidate clusters, as convergent selection can mimic relatedness.

Procedure:

Translate & Align: Translate nucleotide sequences to amino acids. Back-map the nucleotide sequences onto the amino acid alignment.
Identify Mutations: For the boundary sequence and the consensus of each candidate cluster, compare each to the inferred germline sequence (from partis or IgBLAST).
Categorize: For the FR and CDR regions separately, count:
- Replacement (R) mutations: Nucleotide change alters the amino acid.
- Silent (S) mutations: Nucleotide change does not alter the amino acid.
Calculate & Compare: Compute the R/S ratio for the boundary sequence. Perform a binomial test (using BASELINe or custom script) to determine if the observed R/S deviation from the expected neutral baseline (~3.0 for FRs, ~0.8-1.0 for CDRs) is consistent with the R/S profile of each candidate cluster.

Integrated Analysis Pipeline Diagram

Diagram Title: HILARy Boundary Resolution Pipeline

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Tools for Ambiguity Resolution in Clonal Inference

Item / Solution	Function in Protocol	Key Consideration
IgBLAST (NCBI)	Initial sequence annotation (V/D/J genes, SHM).	Provides the foundational annotation for all downstream analysis; requires curated germline databases.
partis (https://github.com/psathyrella/partis)	Probabilistic clustering, germline inference, and lineage modeling.	Gold-standard for model-based assignment; computationally intensive but highly accurate.
Change-O Suite / IgPhyML	Phylogenetic tree construction & selection pressure analysis.	Specialized for immune repertoire data with models of SHM.
SCOPe (Single Cell Operator)	Graph-based clustering using network analysis.	Effective for identifying rare intermediates that bridge clusters.
ImmunoSEQ Analyzer (Adaptive Biotech) or VDJtools	Commercial/open-source suite for clustering & diversity analysis.	Provides standardized, reproducible pipelines for initial clustering and ambiguity flagging.
R/Bioconductor (`alakazam`, `shazam`)	Calculation of distances, R/S ratios, and statistical testing.	Essential for custom evidence weighting and visualization.
Synthetic Spiked-in Control Libraries (e.g., from iRepertoire)	Distinguishing technical PCR/sequencing error from true biological variation.	Critical for calibrating distance thresholds in the specific wet-lab protocol used.
Long-Read Sequencing (PacBio HiFi, Oxford Nanopore)	Resolving complex haplotypes and phasing mutations.	Ultimate empirical check for suspected boundary cases by providing full-length, phased sequences.

Application Notes

Efficient computational resource management is critical for HILARy (Hierarchical Inference of Lymphocyte Antigen Receptor families) clonal family inference from large-scale repertoire sequencing (Rep-Seq) datasets. The exponential growth in sequencing depth, often exceeding 1-10 million sequences per sample, presents significant challenges in runtime and memory footprint, directly impacting the scalability and feasibility of large cohort studies in vaccine and therapeutic antibody development.

Core Computational Challenges in HILARy Workflows

HILARy inference involves multiple computationally intensive steps: sequence quality filtering, V(D)J gene annotation, duplicate/error-aware clustering, lineage tree construction, and selection pressure analysis. Each stage has distinct resource profiles.

Table 1: Typical Computational Resource Requirements for Key HILARy Workflow Steps (Per 1 Million Sequences)

Workflow Step	Approx. Runtime (CPU hrs)	Peak Memory (GB)	Primary Bottleneck
Preprocessing & QC	0.5 - 2	4 - 8	I/O, Compression
V(D)J Alignment	5 - 20	16 - 32	Heuristic Search
Clustering (Naive)	10 - 40	30 - 100+	All-vs-All Comparison
Lineage Tree Building	2 - 10	8 - 64	Graph Traversal
Selection Analysis	1 - 5	4 - 16	Statistical Computation

Optimization Strategies and Their Impact

Recent advances in algorithms and data structures offer substantial improvements.

Table 2: Impact of Optimization Strategies on Runtime and Memory

Optimization Strategy	Implementation Example	Typical Runtime Reduction	Typical Memory Reduction
K-mer based pre-clustering	Use of CDR3 k-mer sketches	40-70%	50-80%
Parallelized Alignment	Multi-threaded IgBLAST/MMseqs2	60-85% (on 16 cores)	+10-20% (per thread overhead)
Probabilistic Data Structures	Bloom filters for unique sequence tracking	~30%	60-90%
Streaming Algorithms	Single-pass clustering (e.g., Alignment-Free)	50-80%	70-95%
Sparse Matrix Operations	For distance calculations in clustering	20-40%	70-85%

Detailed Experimental Protocols

Protocol A: Memory-Efficient Clonal Grouping for HILARy Inference

This protocol details a two-stage clustering approach designed to minimize memory use while maintaining accuracy for clonal family inference.

Materials & Reagents: See "The Scientist's Toolkit" below. Software: Python 3.9+, SciPy, NumPy, parasail library, khmer toolkit.

Procedure:

Input Preparation:
- Start with error-corrected, V(D)J-aligned nucleotide sequences in AIRR-compliant TSV format.
- Extract required fields: sequence_id, sequence_alignment, v_call, j_call, junction.

Stage 1 - K-mer Sketching & Partitioning (Memory Reduction):
- For each unique junction sequence, generate a minimal perfect hash (e.g., using BBhash).
- Encode each junction into a 4-bit-per-base integer array.
- Apply a streaming k-mer counting algorithm (K=5) using a Count-Min Sketch data structure (depth=5, width=100000).
- Partition sequences into "super-groups" based on shared V gene, J gene length, and presence of ≥2 high-frequency (count>3) k-mers. This step reduces the problem space by 80-90%.
Stage 2 - Exact Distance Clustering (Within Partitions):
- For each super-group, load sequences into memory sequentially.
- Calculate pairwise Levenshtein distances only within the super-group using a banded dynamic programming algorithm (bandwidth=7), optimized via SIMD instructions (parasail.nw_banded).
- Perform hierarchical clustering using a single-linkage criterion with a Hamming distance threshold (typically 0.10-0.15 of junction length).
- Output clonal cluster assignments to a new AIRR-formatted file.
Validation & Merge (Optional):
- Perform a lightweight, all-vs-all comparison of cluster centroids across super-groups to merge rare cross-partition families.
- This step uses a pre-computed index of centroids and is typically <5% of total runtime.

Expected Outcomes: This protocol processes 10 million sequences in under 6 hours using <32 GB RAM on a standard 16-core server, compared to >72 hours and >100 GB RAM for a naive all-vs-all approach.

Protocol B: Runtime-Optimized Parallel Alignment for Large Batches

This protocol leverages distributed computing for the V(D)J alignment step, often the initial bottleneck.

Procedure:

Data Chunking:
- Split the raw FASTQ/FASTA file into smaller chunks of 100,000 sequences each using fastp or a custom Python script with gzip compression.
- Record chunk boundaries and sequence IDs in a manifest file.

Distributed Alignment Job Submission:
- Using a workload manager (e.g., SLURM, SGE), submit one array job per chunk.
- Each job runs an alignment tool (e.g., IgBLAST, IMGT/HighV-QUEST) with a pre-built, localized reference database.
- Critical: Ensure all jobs write temporary files to local node SSD storage, not network drives.
Result Aggregation & Deduplication:
- As jobs complete, a master process aggregates results into a single AIRR TSV.
- Use a persistent disk-based hash table (e.g., Redis or sqlite3) to identify and merge duplicate alignments for identical sequences across chunks during aggregation, preventing a final O(n²) step.

Diagrams

HILARy Optimization Workflow

Title: HILARy Optimization Pipeline

Memory vs Runtime Trade-off in Strategies

Title: Strategy Trade-off Space

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Computational Tools & Resources for Rep-Seq Optimization

Item Name	Primary Function	Key Application in HILARy Optimization
MMseqs2	Ultra-fast protein/nt search & clustering	Enables fast, sensitive pre-clustering of sequences before detailed alignment, reducing load on IgBLAST.
IgBLAST (w/ MPI)	V(D)J sequence alignment	The gold-standard aligner, parallelized with MPI to distribute queries across cores/nodes.
Bloom-filter Libraries (pybloom)	Probabilistic membership testing	Tracks seen sequences/patterns in constant memory, eliminating redundant comparisons.
UCX & OpenMPI	High-performance inter-process communication	Critical for low-latency data transfer in distributed alignment workflows on HPC clusters.
Zarr / HDF5 Formats	Chunked, compressed array storage	Stores massive sequence distance matrices on disk with efficient partial I/O, avoiding RAM limits.
Snakemake / Nextflow	Workflow management	Orchestrates complex, multi-step pipelines with automatic resource request and checkpointing.
Intel ISA-L / SSE/AVX	Hardware-accelerated string kernels	Optimizes core edit-distance and hashing calculations via CPU SIMD instructions.
NumPy / SciPy Sparse	Sparse matrix operations	Efficiently represents and computes on sparse sequence similarity graphs.

Within the broader thesis on HILARy (Heavy-paiR Lineage ReconstRuction) for B-cell clonal family inference from Rep-Seq data, successful multi-omics integration is paramount. These application notes provide detailed protocols for converting, validating, and analyzing HILARy's clonal lineage outputs within established immunological and genomic pipelines, enabling systems-level investigation of adaptive immune responses.

HILARy Output Formats and Conversion Protocols

HILARy generates primary outputs detailing inferred clonal families, phylogenetic trees, and mutation annotations. Direct compatibility with downstream tools requires structured conversion.

Primary Output Structure

Table 1: Core HILARy Output Files and Descriptions

File Name	Format	Key Content	Primary Use
`clonal_families.tsv`	TSV	Clone ID, Sequence ID, Isotype, V/J gene, CDR3	Core clonal grouping
`lineage_trees.nwk`	Newick	Phylogenetic tree per clone	Lineage visualization & evolution
`mutations.json`	JSON	Nucleotide/AA substitutions per branch	Somatic hypermutation analysis
`convergence_groups.txt`	Text	Groups of clones with similar CDR3s	Repertoire convergence detection

Protocol: Conversion to AIRR-Compliant Format

The Adaptive Immune Receptor Repertoire (AIRR) Community standards ensure cross-tool compatibility.

Materials:

Input: HILARy clonal_families.tsv, original sequence alignment files.
Software: Python 3.9+, pandas library, airr standards library.
Reference: AIRR Rearrangement schema (v1.4).

Procedure:

Load the HILARy clonal assignments and the original sequencing data (e.g., in changeo or immunarch compatible format).
Map the sequence_id column from HILARy to the sequence_id in the AIRR-formatted TSV.
Create a new column clone_id in the AIRR TSV, populating it with HILARy's assignments. Use -1 for singletons not assigned to a clone.
For clones with lineage trees, create a separate airr-trees.json file. Convert each Newick tree to the PhyloXML-based AIRR Tree schema using the python biopython library.
Validate the output files using the airr-tools validate command-line utility or the Airr R package validation functions.

HILARy to AIRR Standards Conversion Workflow

Protocol for Multi-Omic Integration with Transcriptomic Data

Correlating clonal expansion with gene expression profiles from single-cell RNA sequencing (scRNA-seq) reveals functional states of expanded B-cell clones.

Experimental Design & Reagent Solutions

Table 2: Key Reagents for Linked BCR-seq & scRNA-seq

Reagent / Solution	Vendor (Example)	Function in Multi-Omics Workflow
10x Genomics Chromium Next GEM Single Cell 5' v2	10x Genomics	Partitions single cells for co-encapsulation of mRNA and V(D)J transcripts.
Feature Barcoding technology (CellPlex or Antibody)	10x Genomics	Allows sample multiplexing, critical for pooling patients/conditions pre-assay.
BD Rhapsody BCR Single-Cell Analysis System	BD Biosciences	Alternative platform for coupled whole transcriptome and targeted BCR amplification.
SMARTer V(D)J Reagents for T and B Cells	Takara Bio	Provides template-switching for full-length V(D)J enrichment in plate-based protocols.
Cell Hashing Antibodies (TotalSeq-B)	BioLegend	Antibodies conjugated to oligonucleotide barcodes for sample multiplexing prior to 10x runs.

Protocol: Clonal Tracing in scRNA-seq Data

This protocol assumes scRNA-seq data with paired BCR amplification (e.g., from 10x Genomics Cell Ranger) has been generated.

Materials:

Input: filtered_contig_annotations.csv (from Cell Ranger VDJ), gene_expression_matrix (from Cell Ranger Count).
Software: R (v4.2+), Seurat (v5.0), scRepertoire (v1.10), immunarch.
Reference: HILARy-derived clone list for the same donor.

Procedure:

Load Data: Create a Seurat object from the gene expression matrix. Separately, load the VDJ contig data using scRepertoire::combineTCR().
Merge Assays: Add the clonal information as a new assay or metadata to the Seurat object using scRepertoire::combineExpression().
Cross-Reference with HILARy:
- Extract the CDR3 nucleotide sequences and associated V/J genes for clones of interest from HILARy output.
- Query the single-cell VDJ data for cells containing matching CDR3 sequences and V/J genes. Use a fuzzy matching algorithm (allowing 1-2 nucleotide mismatches) to account for sequencing errors.
- Create a new metadata column (e.g., hilar_clone) in the Seurat object, labeling cells belonging to HILARy-inferred clones.
Downstream Analysis: Subset the Seurat object to focus on cells from expanded clones. Perform differential gene expression (DGE) using Seurat::FindMarkers() between cells of an expanded clone versus all other B cells. Conduct pathway enrichment analysis on DGE results using clusterProfiler.

Clonal Tracing in scRNA-seq Data Workflow

Protocol for Integrating Clonal Dynamics with Clinical Metadata

Longitudinal analysis links clone expansion/contraction to patient treatment and outcome.

Protocol: Longitudinal Clone Tracking Dashboard

Materials:

Input: HILARy outputs across multiple timepoints, Patient clinical data (CSV).
Software: R tidyverse, shiny, ggiraph, survival.
Database: (Optional) SQLite database for large cohort data.

Procedure:

Data Harmonization: For each patient/timepoint, load the HILARy clonal_families.tsv. Calculate clonal metrics: Shannon diversity, clone size distribution, largest clone fraction.
Merge with Clinical Data: Create a master data frame linking each sample's clonal metrics to clinical variables (e.g., therapy received, disease activity score, response status).
Build Shiny Application:
- UI: Create input selectors for Patient ID, Clone ID, Time Range. Define outputs: plot for clone size over time, heatmap of top clones across timepoints, Kaplan-Meier plot panel.
- Server Logic:
  - For a selected clone, retrieve its relative frequency at all timepoints and plot against clinical lab values (e.g., serum IgG titer).
  - Generate a survival plot where patients are stratified by the presence/absence or size of a specific immunodominant clone at baseline, using the survival::survfit() function.
Deploy: Deploy the dashboard locally or on a secure server for collaborative clinical review.

Longitudinal Clinical Integration Dashboard

Validation Protocol: Cross-Tool Clonal Concordance

Ensuring HILARy's inferences are robust requires benchmarking against other clonal grouping tools.

Experimental Protocol

Materials:

Input: Public benchmark dataset (e.g., from Observed Antibody Space) or in-house Rep-Seq data.
Software: HILARy, changeo-clone (DefineClones.py), partis, scoper (R).
Compute: High-performance computing cluster recommended.

Procedure:

Data Processing: Starting from the same filtered AIRR Rearrangement file, generate clonal groupings using each tool with its default parameters for identical V/J gene and CDR3 nucleotide distance thresholds.
Comparison Metric Calculation: Use the clonality R package or custom scripts to compute:
- Pairwise Adjusted Rand Index (ARI) between tools.
- Jaccard index for specific large clones.
- Run-time and memory usage for each tool on identical hardware.
Ground Truth Comparison: If using simulated data with known clonal origins, calculate precision, recall, and F1-score for each tool's ability to recover true clones.
Visualization: Create a clustered heatmap of ARI scores and a bar chart of performance metrics.

Table 3: Example Cross-Tool Concordance Results (Simulated Data)

Tool Comparison (A vs B)	Adjusted Rand Index (ARI)	Mean Jaccard of Top 10 Clones	Relative Runtime
HILARy vs. changeo-clone	0.92	0.88	1.5x
HILARy vs. partis	0.85	0.79	0.3x
HILARy vs. scoper	0.95	0.91	2.1x

Advanced Pathway: Structural Modeling of Convergent Antibodies

Integrating HILARy-identified convergent responses with structural prediction tools.

Protocol: From Clone to Structure

Materials:

Input: HILARy convergence_groups.txt, annotated AIRR file.
Software: ANARCI for domain annotation, IgFold or ABodyBuilder for structure prediction, PyMOL for visualization.
Web Service: NCBI BLAST for germline gene identification.

Procedure:

Select a convergence group from HILARy output containing clones from multiple subjects with similar CDR3s.
For each clone representative, use ANARCI to assign IMGT numbering and identify framework/cdr regions.
Submit the heavy and light chain Fv sequences to IgFold (via local install or API) to generate a predicted 3D model in PDB format.
Superimpose the predicted structures of convergent antibodies in PyMOL to analyze structural commonalities in paratope geometry.
Dock the consensus structural model to a target antigen (if known) using ZDOCK or HADDOCK to hypothesize epitope specificity.

These application notes provide a actionable framework for integrating HILARy's precise clonal inferences into multi-omics workflows, thereby amplifying its value within a thesis on B-cell repertoire dynamics and enabling translational discoveries in immunology and drug development.

Benchmarking HILARy: Validation Strategies and Comparative Analysis with Alternative Tools

Within the broader thesis on HILARy (High-accuracy Inference of Lymphocyte Antigen Receptor families) clonal family inference from repertoire sequencing, establishing robust ground truth is paramount. This Application Notes and Protocols document details methodologies for generating and validating synthetic immune receptor repertoires, alongside functional validation using engineered in vitro cell line systems. These approaches provide controlled datasets to benchmark clonal clustering, lineage reconstruction, and diversity estimation algorithms, directly addressing key challenges in therapeutic antibody discovery and immune monitoring.

High-throughput sequencing of B-cell and T-cell receptor repertoires enables insights into adaptive immune responses. However, computational inference of clonal families—groups of lymphocytes descended from a common ancestor—suffers from ambiguous validation due to the lack of known truth sets in biological samples. Synthetic repertoires with pre-defined clonal structures and in vitro cell lines with known antigen specificities provide essential validation frameworks to assess the accuracy, sensitivity, and specificity of tools like HILARy.

Research Reagent Solutions Toolkit

The following table lists essential reagents and resources for conducting ground truth validation experiments.

Item Name	Supplier/Catalog Example	Function in Validation
Synthetic V(D)J Reference Standards	e.g., AIRR-seq Control Library (LegoChem)	Provides DNA/RNA mixes with known clonal families, V/D/J usage, and mutation profiles for sequencing platform and pipeline calibration.
gBlock Gene Fragments	Integrated DNA Technologies (IDT)	Custom double-stranded DNA fragments used to construct synthetic immune receptor genes with specified mutations for clonal lineage simulation.
HEK 293T Cell Line	ATCC CRL-3216	Highly transfectable cell line used for in vitro expression of synthetic antibody or TCR libraries for functional screening.
pFUSE Vectors	Invivogen	Modular antibody expression plasmids (IgG, Fab) for cloning synthetic variable regions into constant domain backbones.
Fluorescent Antigen Probes	e.g., MHC Dextramers (Immudex)	Multimeric peptide-MHC complexes conjugated to fluorophores for staining and sorting T-cells with known antigen specificity.
Cell Sorting Buffers	BD Pharmingen Stain Buffer	PBS-based buffers with fetal bovine serum to maintain cell viability during fluorescent-activated cell sorting (FACS) based on antigen binding.
Next-Gen Sequencing Kit	Illumina MiSeq v3 (600-cycle)	Provides sufficient read length for full-length variable region sequencing of paired heavy and light chains.
UMI Adapter Kit	NEBNext Multiplex Oligos for Illumina	Adds unique molecular identifiers (UMIs) to cDNA during library prep to correct for PCR amplification bias and sequencing errors.

Protocols

Protocol A: Generation and Sequencing of a Synthetic Repertoire with Defined Clonality

Objective: To create a DNA library mimicking a B-cell receptor repertoire with pre-defined clonal families, somatic hypermutations, and abundances for benchmarking HILARy’s clustering performance.

Materials:

gBlock gene fragments (IDT)
Q5 High-Fidelity DNA Polymerase (NEB)
MiSeq Reagent Kit v3 (Illumina)
UMI Adapter Kit

Procedure:

Clonal Family Design: Design 10-50 distinct "founder" heavy-chain V(D)J sequences. For each founder, generate 5-20 "descendant" sequences by introducing random point mutations (simulating SHM) at a defined rate (e.g., 2-10%).
Sequence Synthesis: Order all designed variable region sequences as gBlock fragments, each flanked by amplification primers and restriction sites.
Library Assembly: Amplify each gBlock via PCR. Gel-purify and pool fragments in a stratified manner to simulate realistic clonal frequency distributions (e.g., a few dominant clones, many rare clones). Ligate into a linearized vector backbone.
Next-Generation Sequencing: Prepare sequencing libraries from the plasmid pool using a kit that incorporates UMIs. Sequence on an Illumina MiSeq platform with 2x300bp paired-end reads to ensure full coverage of the variable region.
Ground Truth Table Generation: Create a comprehensive table mapping every synthesized DNA sequence to its assigned clonal family founder.

Table 1: Synthetic Repertoire Ground Truth Summary

Clonal Family ID	Founder V/J Genes	Number of Unique Sequences	Avg. Mutation Rate (%)	Designed Frequency in Pool (%)
CF_01	IGHV1-201, IGHJ401	15	5.2	12.5
CF_02	IGHV3-2304, IGHJ602	8	3.7	8.1
CF_03	IGHV4-3401, IGHJ501	22	8.9	5.4
...	...	...	...	...
CF_48	IGHV5-5103, IGHJ302	5	2.1	0.1

Protocol B: In Vitro Validation Using an Engineered Antigen-Specific B-Cell Line

Objective: To functionally validate clonal families inferred by HILARy by expressing paired heavy and light chains from a putative family and testing for shared antigen specificity.

Materials:

HEK 293T cells
pFUSE IgG expression vectors
Recombinant target antigen (e.g., HIV gp120)
FACS buffer, anti-human IgG Fc detection antibody

Procedure:

Candidate Selection: From a biological repertoire sequenced and analyzed by HILARy, select 3-5 representative sequences from a computationally inferred clonal family.
Antibody Expression: Clone the heavy and light chain variable regions for each selected sequence into pFUSE-IgG1 vectors. Co-transfect HEK 293T cells in separate wells with heavy/light chain plasmid pairs.
Supernatant Harvest: Culture transfected cells for 5-7 days. Harvest cell culture supernatant containing secreted IgG.
Antigen Binding Assay (ELISA): Coat ELISA plates with target antigen. Add supernatants. Detect bound IgG using an enzyme-conjugated anti-human Fc antibody. Include positive (known binder) and negative (untransfected supernatant) controls.
Data Interpretation: Consistent antigen binding across antibodies derived from the same HILARy-inferred family provides strong functional validation of the clonal grouping. Discrepancies indicate potential inference errors.

Table 2: In Vitro Binding Results for HILARy-Inferred Clonal Family #7

Test Antibody (Sequence ID)	Clonal Family Assignment (HILARy)	ELISA OD450 (Mean ± SD)	Antigen Binding Positive?
BioSeq145	CF_07	2.34 ± 0.21	Yes
BioSeq149	CF_07	1.89 ± 0.15	Yes
BioSeq152	CF_07	0.08 ± 0.02	No
BioSeq160	CF_07	2.01 ± 0.18	Yes
Positive Control	N/A	2.50 ± 0.10	Yes
Negative Control	N/A	0.05 ± 0.01	No

Visualizations

Title: Ground Truth Validation Workflow for HILARy

Title: HILARy Inference Pipeline with Validation Points

Within the HILARy (High-throughput Immune-repertoire Lineage and Repertoire) clonal family inference framework, the accurate evaluation of algorithm performance is paramount for advancing repertoire sequencing research and its applications in immunology and therapeutic discovery. This protocol details the standardized application of the core metrics—Precision, Recall, and Computational Efficiency—to assess and compare clonal inference tools.

Core Performance Metrics: Definitions & Quantitative Benchmarks

Table 1: Core Metric Definitions and Formulas

Metric	Definition	Formula
Precision	The fraction of inferred clonal relationships that are correct (True Positives) out of all inferred relationships. Measures correctness.	Precision = TP / (TP + FP)
Recall (Sensitivity)	The fraction of all true clonal relationships that are correctly identified by the inference algorithm. Measures completeness.	Recall = TP / (TP + FN)
F1-Score	The harmonic mean of Precision and Recall, providing a single balanced metric.	F1 = 2 (Precision * Recall) / (Precision + Recall)*
Computational Efficiency	The computational resources required for analysis, typically measured as wall-clock time and peak memory (RAM) usage.	Time (seconds), Memory (GB)

Table 2: Example Benchmark Results for Select Inference Tools Data sourced from recent benchmarking studies (e.g., Immcantation framework, DANGER comparisons).

Tool / Algorithm	Precision	Recall	F1-Score	Time (min)	Memory (GB)
Partis	0.95	0.85	0.90	120	8.2
SCOPer	0.92	0.88	0.90	95	6.5
Hierarchical Clustering	0.80	0.95	0.87	45	4.0
IGH-DATA	0.98	0.75	0.85	180	12.0

Experimental Protocols

Protocol 1: Generating a Ground Truth Dataset for Metric Calculation

Objective: To create a validated set of clonal families from synthetic or spike-in control data to serve as the benchmark for calculating Precision and Recall.

Materials: See "The Scientist's Toolkit" below. Procedure:

Synthetic Repertoire Generation: Use a tool like IGH-SIM or SONAR to generate a synthetic adaptive immune receptor repertoire (AIRR) dataset.
- Input parameters must include a known, predefined number of distinct clonal families (germline sequences).
- Introduce realistic levels of somatic hypermutation (SHM) and sequencing errors per experimental design.
Data Processing: Process the raw synthetic reads through a standardized pipeline (e.g., pRESTO, IgBLAST) for quality control, V(D)J alignment, and generation of Change-O formatted tables.
Ground Truth Annotation: Using the simulation metadata, annotate each sequence in the processed dataset with its known clone of origin. This annotated file is the ground truth.
Clonal Inference: Run the clonal inference algorithm(s) under test (e.g., using Change-O's DefineClones.py) on the processed, but un-annotated, synthetic data to generate group assignments for each sequence.
Metric Calculation: Use a script (e.g., in R with shazam and dplyr) to compare algorithm assignments against the ground truth.
- A True Positive (TP) is a pair of sequences placed in the same inferred clone that share the same ground truth label.
- A False Positive (FP) is a pair placed in the same inferred clone but with different ground truth labels.
- A False Negative (FN) is a pair placed in different inferred clones but sharing the same ground truth label.
- Calculate Precision, Recall, and F1-Score using the formulas in Table 1.

Protocol 2: Benchmarking Computational Efficiency

Objective: To reproducibly measure the runtime and memory consumption of a clonal inference tool.

Materials: Computing infrastructure (HPC, cloud, or local server), containerization software (Docker/Singularity), system monitoring tool (/usr/bin/time, psrecord). Procedure:

Environment Standardization: Containerize the analysis pipeline using Docker to ensure consistent dependency versions across runs.
Dataset Curation: Prepare a series of input files (in AIRR .tsv format) of increasing size (e.g., 10^3, 10^4, 10^5, 10^6 sequences).
Execution & Profiling: For each input size:
- Use the time command (e.g., /usr/bin/time -v) to execute the core clonal inference command.
- Record the "Elapsed (wall clock) time" and "Maximum resident set size (kbytes)".
- Optionally, use a profiler like psrecord to graph CPU and memory usage over time.
Data Aggregation: Plot runtime and peak memory usage against input size. The slope indicates scalability—a critical factor for large-scale repertoire studies.

Visualizations

Title: Performance Evaluation Workflow

Title: Relationship Between Core Metrics

The Scientist's Toolkit

Table 3: Essential Research Reagents & Solutions for Performance Benchmarking

Item	Function/Description
Synthetic Repertoire Simulators (e.g., IGH-SIM, SONAR)	Generates ground truth AIRR-seq data with known clonal relationships for controlled benchmarking.
AIRR-Compliant Data Files (.tsv)	Standardized input/output format (via AIRR Community) ensuring interoperability between tools.
Container Images (Docker/Singularity)	Provides reproducible, version-controlled computational environments (e.g., Immcantation, VDJServer images).
Benchmarking Suites (e.g., DANGER, ImmBench)	Curated scripts and datasets for standardized performance comparison across multiple algorithms.
High-Performance Computing (HPC) Resources	Essential for running efficiency benchmarks on large datasets, measuring scalability.
AIRR Tools (pRESTO, IgBLAST, Change-O)	Core software suite for processing raw reads, performing V(D)J alignment, and basic clonal grouping.
R/Python Packages (shazam, dplyr, scipy, pandas)	Libraries for calculating metrics, statistical analysis, and visualizing benchmarking results.

Application Notes

Clonal family inference from B-cell receptor (BCR) repertoire sequencing is a foundational step in immunoinformatics, enabling the study of adaptive immune responses, antibody discovery, and lymphoid cancer phylogenetics. This analysis compares four prominent tools—HILARy, partis, Change-O, and SCOPer—within the context of a thesis focused on HILARy's methodology and performance.

HILARy (Hierarchical clustering for Lineage Analysis of Repertoires): A method leveraging hierarchical clustering based on sequence similarity thresholds and V/J gene annotations. It is designed for high-throughput analysis of large-scale repertoire data, balancing computational efficiency with accurate lineage grouping.
partis: A probabilistic framework that uses hidden Markov models (HMMs) for BCR annotation and clustering. It estimates parameters from data for germline inference and somatic hypermutation (SHM) modeling, offering high accuracy at increased computational cost.
Change-O: A suite of tools for advanced BCR repertoire analysis. Its DefineClones.py script performs single-linkage clustering based on nucleotide or amino acid distance thresholds, often requiring prior annotation from tools like IMGT/HighV-QUEST.
SCOPer (Spectral Clustering Of Paired-end Reads): A spectral clustering algorithm designed specifically to handle the complexities of single-cell paired heavy and light chain data, preserving natural pairings while clustering sequences.

Table 1: Core Algorithm & Quantitative Performance Comparison

Tool	Core Algorithm	Primary Input	Key Strengths	Reported Accuracy* (F1-score/Precision)	Computational Demand
HILARy	Hierarchical clustering with adaptive thresholds	Annotated sequences (V/J, CDR3)	Speed, scalability for bulk data, intuitive thresholds	~0.92-0.95 (on simulated bulk data)	Low-Medium
partis	HMM-based probabilistic clustering	Raw FASTQ reads	High accuracy, integrated annotation/germline inference, SHM modeling	~0.96-0.98 (on simulated data)	High
Change-O	Single-linkage clustering	Annotated sequences (e.g., from IMGT)	Flexibility, integrates with extensive downstream analysis pipeline	~0.90-0.94 (depends on annotation source)	Low
SCOPer	Spectral clustering	Paired heavy-light chain sequences	Preserves natural pairings, effective for complex single-cell data	~0.94-0.97 (on paired-cell simulations)	Medium-High

*Accuracy metrics are approximate and dataset-dependent. Benchmarks typically use simulated repertoires with known ground truth.

Table 2: Contextual Application Suitability

Feature	HILARy	partis	Change-O	SCOPer
Optimal Data Type	Bulk Ig-seq (e.g., RNA)	Bulk Ig-seq from raw reads	Pre-annotated bulk sequences	Single-cell BCR-seq (paired)
Germline Inference	Requires external tool	Integrated, sophisticated	Requires external tool (e.g., IgBLAST)	Limited, often uses external
SHM Modeling	No	Yes, detailed	Post-hoc analysis (e.g., BASELINe)	Within inferred clones
Output Integration	Clonal tables	Clonal tables, annotated FASTA	Comprehensive Change-O/Immcatation formats	Clonal networks, pairings

Experimental Protocols

Protocol 1: Benchmarking Clonal Inference Accuracy Using Simulated Data Objective: To quantitatively compare the clonal grouping performance of HILARy, partis, Change-O, and SCOPer.

Data Simulation: Use IGoR or AbSim to generate a synthetic BCR repertoire dataset with known, true clonal families. Include parameters for SHM frequency (~5-15%), diverse V/J gene usage, and varying clone sizes.
Tool Execution:
- HILARy: Run HILARy on the simulated sequences (pre-annotated with V/J genes and CDR3 regions using IgBLAST). Use default distance thresholds (e.g., nucleotide Hamming distance = 0.1).
- partis: Run partis partition directly on the raw simulated FASTQ reads.
- Change-O: Annotate simulated sequences with IgBLAST. Run DefineClones.py with distance threshold = 0.1 (nucleotide).
- SCOPer: For paired data simulation, run SCOPer spectral clustering on the concatenated heavy-light chain sequences.
Validation: Compare the inferred clusters from each tool to the ground truth simulation labels. Calculate precision, recall, and F1-score for clonal assignment.

Protocol 2: Processing Human PBMC Bulk BCR-seq Data Objective: To apply each tool to real-world human peripheral blood mononuclear cell (PBMC) repertoire data.

Sample Prep & Sequencing: Isolate RNA from PBMCs, perform reverse transcription with gene-specific primers for IGH genes, and sequence on an Illumina platform (2x300 bp).
Preprocessing: Use pRESTO for read quality control, masking of primers/adapters, merging paired-end reads, and filtering out non-functional sequences.
Clonal Grouping Paths:
- Path A (HILARy/Change-O): Annotate merged FASTA with IgBLAST against IMGT reference. Run HILARy or Change-O's DefineClones.py on the output.
- Path B (partis): Provide the preprocessed (but unannotated) FASTA directly to partis partition.
Analysis: Compare the number of clones, clone size distributions, and SHM levels per clone across tools.

Protocol 3: Single-Cell BCR-seq Analysis with SCOPer and HILARy Adaptation Objective: To evaluate clustering on paired heavy-light chain data.

Single-Cell Library: Generate data using 10x Genomics Chromium or similar platform yielding linked V(D)J sequences.
Cellranger: Use cellranger vdj for initial cell calling, assembly, and annotation of paired contigs.
Clonal Grouping: Feed the paired heavy-light chain FASTA files to SCOPer for spectral clustering that respects natural pairings.
Comparative Analysis: As a contrast, separate heavy chains from the pairs and run them through HILARy using standard bulk settings. Compare the resulting clone assignments and the preservation of light chain information.

Visualizations

Clonal Inference Tool Selection Workflow

HILARy Hierarchical Clustering Process

The Scientist's Toolkit: Key Research Reagent Solutions

Table 3: Essential Materials for BCR Clonal Inference Workflows

Item	Function & Application
IMGT/GENE-DB Reference	Gold-standard database of immunoglobulin gene alleles. Essential for accurate V(D)J gene annotation.
IgBLAST	Command-line tool from NCBI for aligning BCR sequences to germline references. Provides detailed annotation.
pRESTO Toolkit	Suite of Python scripts for preprocessing raw sequencing reads: quality filtering, merging, deduplication.
Synthetic BCR Libraries (e.g., from IGoR)	Generate ground-truth simulated repertoire data for benchmarking algorithm accuracy.
10x Genomics Chromium Single Cell V(D)J Kit	Commercial solution for generating linked heavy-light chain BCR sequences from single cells.
MiXCR	Alternative integrated software for end-to-end analysis (alignment, assembly, clustering). Useful for cross-validation.
Immcatation Database	Online resource and database schema for storing, sharing, and analyzing annotated immune repertoire data.

Within repertoire sequencing research, inferring B-cell or T-cell clonal families from high-throughput sequencing data is a foundational step. A variety of clustering methods exist, each with distinct algorithmic approaches. This Application Note provides a detailed analysis of the HILARy (High-throughput lymphocyte Analysis by Reconstruction) method, contrasting it with other prevalent techniques to guide researchers and drug development professionals in selecting the optimal tool for their experimental goals.

Comparison of Clonal Family Inference Methods

The following table summarizes the core characteristics, strengths, and limitations of HILARy against other common methods.

Table 1: Quantitative and Qualitative Comparison of Clustering Methods

Method	Core Algorithm	Primary Strength	Key Limitation	Optimal Use Case
HILARy	Expectation-Maximization on V(D)J junctions + Phylogeny	Integrates lineage tree likelihood; models hypermutation. Computationally intensive.	Best for somatic hypermutation (SHM)-rich repertoires (e.g., antigen-experienced B cells).
Change-O (DEFINE) / GLIPH2	Hierarchical clustering on Hamming distance / TCR motif	Fast, highly sensitive to small clones.	Ignores SHM; may split clones with high mutation.	Initial broad screening; TCR specificity groups.
Partis	Hidden Markov Model (HMM) on full V(D)J	High accuracy annotating V/D/J and inferring naive ancestor.	High resource demand for large datasets.	Detailed annotation and naive sequence reconstruction.
Decombinator / mixcr	Rule-based CDR3 identification + clustering	Extremely fast, standardized pipeline.	Less accurate for highly mutated sequences.	High-volume initial processing and annotation.

Table 2: Performance Metrics on Benchmark Datasets*

Method	Precision (Mean)	Recall (Mean)	F1-Score (Mean)	Avg. Runtime (10^5 seqs)
HILARy	0.95	0.88	0.91	~8 hours
Change-O (DEFINE)	0.91	0.85	0.88	~15 minutes
Partis	0.97	0.90	0.93	~6 hours
mixcr	0.89	0.92	0.90	~10 minutes

*Synthetic benchmark data simulating human B-cell repertoires with varying SHM levels (0-15%). Runtime is approximate and system-dependent.

Detailed Protocol for HILARy Clonal Inference

This protocol is designed for B-cell receptor (BCR) heavy chain repertoire sequencing data.

I. Preprocessing and Input Preparation

Sequence Annotation: Use IgBLAST or mixcr to align sequences to IMGT reference genes. Output must include: (a) V, D, J gene calls, (b) nucleotide CDR3 sequence, (c) alignment details.
Data Formatting: Convert annotations to HILARy's required format (FASTA with IMGT-gapped sequences). The CreateGermlines.py tool (from Change-O suite) can infer the germline V segment sequence for each read.
Quality Filtering: Remove sequences with stop codons in CDR3, low alignment scores, or non-productive rearrangements.

II. Running HILARy Clustering

Critical Parameters:

--dist: Initial Hamming distance threshold for pre-clustering (CDR3 nucleotide). Adjust based on error rate.
--iter: Maximum number of EM iterations. Increase for complex datasets.
--collapse: Collapse unique sequences while preserving duplication counts.

III. Post-processing and Output Interpretation

Output Files: Key outputs include *_clones.txt (clone assignments) and *_trees.json (lineage trees per clone).
Validation: Assess clone size distribution. Unexpectedly many singletons may indicate overly stringent clustering.
Downstream Analysis: Use Dowser (compatible toolkit) to analyze and visualize the inferred phylogenetic trees for SHM patterns and selection pressure.

Visualization of Method Selection and Workflow

Selection Logic for Clustering Methods

HILARy Experimental Workflow

The Scientist's Toolkit: Essential Research Reagents & Software

Table 3: Key Resources for HILARy-based Repertoire Analysis

Item	Function & Relevance
IMGT/GENE-DB Reference Database	Gold-standard reference for V, D, J gene alleles. Essential for accurate initial sequence alignment.
IgBLAST or mixcr	Software for performing the initial V(D)J alignment and annotation. Creates the necessary input for HILARy.
Change-O Toolkit Suite	Provides essential utilities (`DefineClones.py`, `CreateGermlines.py`) for data reformatting and pre-clustering.
HILARy Software Package	Core software implementing the expectation-maximization and phylogenetic inference algorithm.
Dowser Package	Specialized R package for analyzing and visualizing phylogenetic trees output by HILARy.
Synthetic Benchmark Datasets	Known-truth datasets (e.g., from `AbSynth`) for validating pipeline performance and tuning parameters.
High-Memory Compute Node	HILARy's EM algorithm is memory and CPU intensive; >32GB RAM and multiple cores are recommended.

HILARy is the method of choice when the research question centers on the phylogenetic history and somatic hypermutation patterns of B-cell clonal families, such as in studies of affinity maturation, vaccine response, or chronic infection. Its primary strength is integrating clonal partitioning with lineage tree inference, offering a biologically nuanced model at the cost of computational speed. For rapid, large-scale screening or analysis of minimally mutated repertoires (e.g., naive B cells or most TCR studies), faster methods like Change-O or mixcr are more appropriate. The selection framework and protocols provided here enable informed methodological decisions in repertoire sequencing research.

1. Introduction and Thesis Context This Application Note details a methodology for evaluating the consistency of clonal family inference tools, a central challenge in B-cell repertoire sequencing (Rep-Seq) analysis. The study is framed within a broader thesis on HIgh-throughput Lymphocyte Antigen Receptor (HILARy) clonal family inference, which posits that methodological discrepancies in clonotyping significantly impact downstream biological interpretations, such as tracking vaccine-induced B-cell lineages or identifying therapeutic antibody candidates. We apply multiple publicly available clonotyping tools to a standard public COVID-19 Rep-Seq dataset to assess concordance.

2. Data Source and Pre-processing

Dataset: ARchive of B-cell Immunoglobulin Sequences (AbSeq) from COVID-19 convalescent patients (e.g., Study PRJNA629089). Raw FASTQ files for IgG+ memory B-cell repertoires were downloaded.
Pre-processing Protocol:
- Quality Control & Merging: Use fastp (v0.23.2) with parameters --detect_adapter_for_pe --merge --merged_out to trim adapters, remove low-quality bases (Q<20), and merge paired-end reads.
- Alignment & Assembly: Align merged reads to IMGT reference V, D, J genes using IgBLAST (v1.19.0) with the -organism human flag. Generate AIRR-compliant Rearrangement tables (.tsv).
- Data Curation: Filter productive sequences only (sequence_alignment starts with 'C', no stop codons). Remove sequences with low confidence V gene assignment (v_identity < 0.95).

3. Clonal Inference Tool Application Protocol Four tools representing different algorithmic approaches were applied to the same pre-processed AIRR.tsv file.

Tool 1: Change-O (DefineClones.py) - Single-linkage hierarchical clustering.
- Command: DefineClones.py -d <input.tsv> --act set --model ham --norm len --dist 0.10
- Key Parameter: Hamming distance threshold (0.10 for nucleotide).
Tool 2: scoper (spectralClustering) - K-means-like clustering on phylogenetic distance.
- R Script:
Tool 3: immuneSIM (for synthetic ground truth comparison) - In silico repertoire generation.
- R Script:
Tool 4: partis (v0.17.0) - HMM-based annotation and clustering.
- Command: partis annotate --infname input.fasta --outfname partis_output.yaml --all-annotations

4. Results and Quantitative Comparison Key metrics were extracted from each tool's output for the top 10 most expanded clones (by read count) in a sample.

Table 1: Clonal Assignment Concordance Across Tools

Clone Rank (by Tool1 Count)	Tool1 (Change-O) Clone ID	Tool2 (scoper) Clone ID	Tool3 (partis) Clone ID	Sequences in Intersection	% Agreement (Pairwise, Tool1 vs.)
1	Clone_1	Cluster_A	Group_alpha	1250	89% (vs. Tool2), 78% (vs. Tool3)
2	Clone_2	Cluster_B	Group_beta	980	95% (vs. Tool2), 82% (vs. Tool3)
3	Clone_3	Cluster_C	Group_gamma	450	75% (vs. Tool2), 65% (vs. Tool3)
...	...	...	...	...	...
Aggregate (Top 10)	10 distinct	10 distinct	14 distinct	-	Avg: 86% (T1vT2), 75% (T1vT3)

Table 2: Tool Performance and Runtime Metrics

Tool	Algorithm Type	Key Distance Metric	Computational Time (per 10k seq)	Memory Peak (GB)	Outputs AIRR Format?
Change-O	Hierarchical Clustering	Hamming (nucleotide)	~2 min	1.2	Yes
scoper	Spectral Clustering	Hamming (AA)	~5 min	2.5	Yes
partis	HMM Gluing	Phylogenetic	~45 min	8.0	No (custom YAML)

5. Visualization of Analysis Workflow

Title: Workflow for Multi-Tool Clonotype Consistency Study

6. The Scientist's Toolkit: Research Reagent Solutions

Item/Category	Example Product/Software	Function in HILARy Clonal Inference
Rep-Seq Wet-Lab Kit	10x Genomics Chromium Next GEM Single Cell 5' v2	Enables linked V(D)J and gene expression profiling from single B cells.
Sequence Annotation	IMGT/HighV-QUEST, IgBLAST	Provides standardized germline V/D/J gene assignment and sequence annotation.
Clonal Grouping Tool	Change-O, scoper, partis, DADA2 (for denoising)	Algorithms to cluster sequences originating from the same progenitor B cell.
Analysis Suite	Immcantation Portal (pRESTO, Change-O, alakazam)	A standardized pipeline suite for Rep-Seq data from raw reads to statistical analysis.
Synthetic Control	immuneSIM, OLGA	Generates in silico repertoires with known clonal relationships to benchmark tools.
Visualization & Reporting	Dowser (for lineage trees), ggplot2 (R), AIRR Community Python libs	Enables visualization of clonal lineages, diversity metrics, and publication-quality figures.
Data Standard	AIRR Data Representation Standard	Critical schema for data sharing and ensuring interoperability between different tools.

Conclusion

HILARy provides a robust and conceptually clear framework for inferring B cell clonal families from repertoire sequencing data, essential for deciphering adaptive immune responses. This guide has traversed from foundational biology through practical implementation, optimization, and validation. The key takeaway is that successful clonal inference requires a synergistic approach: pairing a well-understood algorithm like HILARy with rigorous data preprocessing, parameter optimization tailored to the biological question, and validation against benchmarks. For biomedical research, accurate clonal tracing is no longer a niche bioinformatics task but a critical component for discovering broad-neutralizing antibodies, understanding dysregulation in cancer and autoimmunity, and evaluating vaccine efficacy at a clonal level. Future directions point towards integrating single-cell multi-omics data, applying machine learning to refine lineage relationships, and developing standardized benchmarking platforms to propel the field towards more reproducible and clinically actionable insights.

Decoding B Cell Lineages: A Comprehensive Guide to HILARy Clonal Family Inference from Repertoire Sequencing Data

Decoding B Cell Lineages: A Comprehensive Guide to HILARy Clonal Family Inference from Repertoire Sequencing Data

Abstract

Understanding B Cell Clonality: Why HILARy Inference is Fundamental to Immunology Research

Application Notes

Experimental Protocols

Protocol 1: Antigen-Specific B Cell Isolation and BCR Repertoire Sequencing

Protocol 2: In Vitro Affinity Maturation & Kinetic Analysis

Mandatory Visualization

The Scientist's Toolkit: Research Reagent Solutions

Application Notes: Integrating HILARy Clonal Family Inference into the Rep-Seq Pipeline

Detailed Protocols

Protocol 1: HILARy-Based Clonal Family Inference from Raw FASTQ Files

Protocol 2: Experimental Validation of Inferred Clones via Single-Cell Sequencing

Visualizations

The Scientist's Toolkit: Research Reagent & Software Solutions

Key Concepts & Data

V(D)J Gene Segment Sharing

CDR3 Amino Acid Sequence Similarity

Germline Sequence Reconstruction

Application Notes & Protocols

Protocol 1: Initial Clonal Grouping via V(D)J and CDR3

Protocol 2: HILARy-Enhanced Germline Reconstruction and Refinement

Protocol 3: Validation by Synthetic Repertoires

The Scientist's Toolkit

Visualization of Workflows and Relationships

Application Notes

Experimental Protocols

Diagrams

The Scientist's Toolkit

Enabling Key Research Questions

Detailed Application Notes & Protocols

Protocol: Longitudinal Tracking of Antigen-Specific Clonal Dynamics

Protocol: Identifying Tumor-Reactive T-cell Clones for Biomarker Discovery

Data Presentation and Analysis

A Step-by-Step Workflow: Implementing HILARy for Clonal Family Inference in Practice

Research Reagent Solutions Toolkit

Protocol: FASTQ File Preparation for IMGT Submission

Initial Quality Control and Demultiplexing

Read Processing and Error Correction

Data Formatting for IMGT/HighV-QUEST

Protocol: Annotating V(D)J Genes with IMGT/HighV-QUEST

Online Submission and Parameter Selection

Interpretation of Key Output Files for Clonal Inference

Post-IMGT Processing for HILARy Input

Visualized Workflows

IMGT/HighV-QUEST Analysis and Data Extraction Logic

Application Notes & Protocols

Protocol 3.1: Input Data Preparation for HILARy Analysis

Protocol 3.2: Executing HILARy Clustering

Protocol 3.3: Validation and Downstream Analysis

Visualizations

The Scientist's Toolkit: Research Reagent Solutions

Core Quantitative Metrics & Performance Benchmarks

Experimental Protocol: Determining Optimal Distance & Linkage

The Scientist's Toolkit: Research Reagent Solutions

Visualization of the Parameter Optimization Workflow

Visualization of Linkage Criteria Impact on Cluster Formation

Key Analytical Objectives

Experimental Protocols

Protocol 4.1: Lineage Tree Reconstruction from a HILARy-Inferred Clonal Family

Protocol 4.2: Analysis of Somatic Hypermutation Patterns

Mandatory Visualizations

Application Note: Tracking Vaccine-Specific B Cell Lineages

Key Quantitative Findings from Recent Studies (2023-2024)

Detailed Protocol: Enrichment and Sequencing of Antigen-Specific B Cells

Detailed Protocol: HILARy-Based Clonal Lineage Inference

Application Note: Identifying Pathogenic Clones in Autoimmunity

Key Quantitative Findings from Recent Studies (2023-2024)

Detailed Protocol: Identifying Tissue-Restricted Pathogenic Clones

Detailed Protocol: Functional Validation of Pathogenicity

The Scientist's Toolkit: Research Reagent Solutions

Overcoming Common Pitfalls: Best Practices for Optimizing HILARy Analysis and Data Quality

Quantitative Impact on Clustering Accuracy

Detailed Experimental Protocols

Protocol 2.1: UMI-Based Duplicate Removal and Error Correction for HILARy Inference

Protocol 2.2: Post-Sequencing Statistical Error Correction for Legacy Data

Visualization of Workflows

The Scientist's Toolkit: Research Reagent Solutions

Diagnostic Criteria and Decision Framework