Decoding Sepsis Complexity: A Guide to MOFA+ for Immune Cell Heterogeneity Analysis in Drug Development

Henry Price Jan 12, 2026 335

This article provides a comprehensive guide for biomedical researchers and drug development professionals on applying Multi-Omics Factor Analysis plus (MOFA+) to dissect immune cell heterogeneity in sepsis.

Decoding Sepsis Complexity: A Guide to MOFA+ for Immune Cell Heterogeneity Analysis in Drug Development

Abstract

This article provides a comprehensive guide for biomedical researchers and drug development professionals on applying Multi-Omics Factor Analysis plus (MOFA+) to dissect immune cell heterogeneity in sepsis. We cover the foundational rationale for using this advanced statistical tool in sepsis research, detail a step-by-step methodological workflow from data integration to interpretation, address common troubleshooting and optimization challenges specific to immunological datasets, and validate findings through comparison with alternative methods. The guide synthesizes current best practices to enable robust identification of latent cellular states and molecular drivers, ultimately aiming to accelerate the discovery of novel therapeutic targets and biomarkers for this complex syndrome.

Why MOFA+? Unraveling Sepsis Immune Dysregulation Through Multi-Omics Integration

Application Note: MOFA+ for Unifying Multi-Omics Views of Sepsis Immune Dysfunction

Thesis Context: Sepsis-induced immunosuppression and heterogeneous patient outcomes stem from complex, multi-layered dysregulation across cell types. Single-cell RNA-seq (scRNA-seq) has revealed transcriptomic heterogeneity but provides an incomplete picture. A multi-omics approach, integrating scRNA-seq with surface proteomics, chromatin accessibility, and methylation data, is critical to delineate the regulatory axes driving immune cell dysfunction. This application note details the use of MOFA+ (Multi-Omics Factor Analysis v2) as a robust statistical framework for integrative analysis of such matched multi-omics data from septic patient samples, moving beyond the limitations of single-modality studies.

Core Challenge: In a recent cohort of 15 septic patients (8 survivors, 7 non-survivors) and 5 healthy controls, PBMCs were profiled using CITE-seq (scRNA-seq + 25 surface protein markers) and a subset (n=10 patients) with single-cell ATAC-seq. Univariate analyses failed to explain outcome variance.

MOFA+ Application: Data matrices (cells x features) for each modality (RNA, ADT, ATAC) were integrated using MOFA+. The model identified 5 latent factors (LFs) explaining cross-omics variance.

Table 1: Key Latent Factors Identified by MOFA+ in Sepsis Cohort

Latent Factor Variance Explained Key Associated Features Clinical & Biological Interpretation
LF1 34% (RNA), 41% (ADT), 22% (ATAC) RNA: HLA-DRA↓, CD74↓. ADT: HLA-DR↓, CD86↓. ATAC: Open chromatin near CIITA gene↓. Global monocyte dysfunction / MHC-II shutdown. Strongly correlated with mortality (p=0.002).
LF2 18% (RNA), 15% (ADT), 8% (ATAC) RNA: GZMB↑, GNLY↑. ADT: CD56↑, CD16+. ATAC: Accessibility in NK cell effector loci↑. NK cell activation continuum. High scores linked to secondary infection risk.
LF3 12% (RNA), 5% (ADT), 30% (ATAC) RNA: IL7R↑, CCR7↑. ADT: CD45RA+, CD95-. ATAC: Open chromatin in TCF7 locus↑. Naïve T cell preservation. Associated with survival and recovery of immune competence.
LF4 8% (RNA), 22% (ADT), 10% (ATAC) RNA: S100A8/9↑, CXCR2↑. ADT: CD11b↑, CD66b+. ATAC: Myeloid enhancer accessibility↑. Immature neutrophil inflammation signature. Correlated with early organ failure score.
LF5 5% (RNA), 10% (ADT), 5% (ATAC) RNA: PDCD1↑, LAG3↑. ADT: PD-1↑, TIM-3+. ATAC: Accessibility in exhaustion loci. T cell exhaustion program. Not directly outcome-linked, but modified by LF1.

Conclusion: MOFA+ integration revealed that mortality-linked immunosuppression (LF1) is a multi-omics program involving coordinated transcriptional, protein surface, and epigenetic changes, invisible to scRNA-seq alone. This identifies HLA-DR expression as a multi-omics node and provides a stratified map for targeted therapy.


Protocol 1: Generation of Matched Single-Cell Multi-Omics Data from Septic Patient PBMCs

Objective: To generate high-quality, matched single-cell RNA-seq, protein expression (ADT), and chromatin accessibility (ATAC-seq) data from fresh PBMCs of septic patients for MOFA+ integration.

Materials:

  • Fresh whole blood (<2hrs from draw) from septic patients (meeting Sepsis-3 criteria) and healthy controls.
  • 10X Genomics Chromium Next GEM Single Cell Multiome ATAC + Gene Expression kit.
  • BD Rhapsody system with AbSeq protein labeling panels (custom panel of 25 immune surface markers).
  • Ficoll-Paque PLUS for PBMC isolation.
  • Nuclei Isolation Kit (for ATAC partitioning).
  • Buffer EB (Elution Buffer) for library elution.
  • Bioanalyzer/TapeStation and Qubit for QC.

Procedure:

  • PBMC Isolation & Viability: Isolate PBMCs via density gradient centrifugation. Assess viability (>95% via trypan blue). Count cells.
  • Cell Partitioning & Barcoding:
    • For Multiome (ATAC + GEX): Follow 10X Multiome kit protocol. Briefly, tagment nuclei with transposase, partition into Gel Beads-in-emulsion (GEMs), and perform shared barcoding for linked ATAC and cDNA libraries from the same cell.
    • For CITE-seq/AbSeq (GEX + ADT): Use remaining aliquot. Stain live cells with conjugated antibody-derived tags (ADTs) from BD AbSeq panel. Wash thoroughly. Partition into wells/beads per BD Rhapsody protocol for joint mRNA and ADT library prep.
  • Library Preparation & Sequencing: Generate libraries per manufacturer instructions. Pool libraries at appropriate molar ratios. Sequence on Illumina NovaSeq: Multiome (ATAC: 50bp paired-end; GEX: 50bp single-end), CITE-seq (GEX: 150bp paired-end; ADT: 50bp single-end).

Protocol 2: MOFA+ Integration Workflow for Sepsis Multi-Omics Data

Objective: To integrate scRNA-seq, ADT, and scATAC-seq data matrices from matched samples using MOFA+.

Software & Packages: R (v4.2+), MOFA2 package, Seurat, Signac, Matrix.

Procedure:

  • Preprocessing & Feature Selection:
    • scRNA-seq (GEX): Process with Seurat. Filter cells (nFeature_RNA > 500, <5000; percent.mt < 20%). Normalize (SCTransform). Retain top 3000 highly variable genes.
    • ADT Data: Center log-ratio (CLR) normalize protein counts per cell. Use all 25 markers.
    • scATAC-seq: Process with Signac. Call peaks using MACS2 on aggregated data. Create a cell x peak binary matrix. Filter for peaks present in >10 cells. Retain top 50,000 most accessible peaks.
  • Creating the MOFA+ Object: Convert each modality (RNA, ADT, ATAC) to a Matrix object (cells x features). Ensure cell IDs are matched across modalities. Use create_mofa() to build the object.
  • Model Training & Factor Inference: Set training options (set_train_options) with 10% of data as test set to avoid overfitting. Set model options (set_model_options) to automatically determine number of factors (suggested start: 10-15). Train the model (run_mofa).
  • Downstream Analysis:
    • Factor Characterization: Correlate factor values with clinical metadata (e.g., outcome, SOFA score).
    • Feature Weights: Extract top-weighted features for each factor and modality using get_weights.
    • Annotation: Annotate factors by correlating with known cell type (from RNA) or pathway markers.
    • Visualization: Plot factor values per sample/group (plot_factors), heatmaps of top features (plot_data_heatmap), and factor robustness (plot_factor_cor).

Visualizations

Diagram 1: MOFA+ Integration Workflow for Sepsis Multi-Omics

G cluster_samp Matched Patient PBMC Sample GEX scRNA-seq (Gene Expression) Proc1 Preprocessing & Feature Selection GEX->Proc1 ADT Surface Protein (Abseq/CITE-seq) ADT->Proc1 ATAC scATAC-seq (Chromatin Access) ATAC->Proc1 Mat1 Cells x Genes Matrix Proc1->Mat1 Mat2 Cells x Proteins Matrix Proc1->Mat2 Mat3 Cells x Peaks Matrix Proc1->Mat3 MOFAobj Create Multi-Omics MOFA+ Object Mat1->MOFAobj Mat2->MOFAobj Mat3->MOFAobj MOFAmodel Train Model (Identify Latent Factors) MOFAobj->MOFAmodel Output Output: Latent Factors (Integrated Drivers) MOFAmodel->Output Int Interpretation: Factor Annotation & Clinical Correlation Output->Int Clinical Clinical Metadata (Outcome, SOFA) Clinical->Int

Diagram 2: Multi-Omics Characterization of a Sepsis Latent Factor

G LF Latent Factor 1 (Mortality Linked) RNA Transcriptome HLA-DRA ↓ CD74 ↓ LF->RNA High Weight ADT Surface Proteome HLA-DR protein ↓ CD86 ↓ LF->ADT High Weight ATAC Epigenome CIITA Locus Accessibility ↓ LF->ATAC High Weight Pheno Cellular Phenotype Monocyte Dysfunction Anergy RNA->Pheno ADT->Pheno ATAC->Pheno Clinical Clinical Outcome Mortality ↑ Pheno->Clinical


The Scientist's Toolkit: Key Research Reagents & Solutions

Table 2: Essential Materials for Sepsis Multi-Omics Research

Item Function & Application Example/Provider
10X Chromium Next GEM Single Cell Multiome ATAC + Gene Expression Simultaneous profiling of chromatin accessibility (ATAC) and gene expression (RNA) from the same single nucleus/cell. Enables direct cis-regulatory linkage. 10x Genomics (Cat# 1000285)
BD Rhapsody System with AbSeq Panels High-parameter single-cell analysis platform allowing combined mRNA and targeted surface protein (ADT) quantification. Custom panels for immune monitoring. BD Biosciences
Cell Hashtag Oligonucleotides (HTOs) For sample multiplexing. Allows pooling of samples from multiple patients/conditions pre-processing, reducing batch effects and costs. BioLegend (TotalSeq-A/C)
Nuclei Isolation Kit Gentle, optimized lysis of cytoplasm for nuclei isolation, critical for high-quality snRNA-seq or Multiome ATAC+Exp workflows. 10x Genomics (Cat# 1000494) or Miltenyi
Ficoll-Paque PLUS Density gradient medium for reliable isolation of viable PBMCs from whole blood of septic patients and controls. Cytiva
DNA Clean & Concentrator Magnetic Beads For efficient size selection and clean-up of ATAC-seq and sequencing libraries. Essential for removing adapter dimers. Zymo Research
Next-Generation Sequencing Kits High-output, paired-end sequencing reagents for generating sufficient depth across multi-omics libraries. Illumina NovaSeq 6000 S4 Reagent Kit

In the context of a broader thesis on applying MOFA+ to immune cell heterogeneity in sepsis research, this document outlines the core principles of the Multi-Omics Factor Analysis+ (MOFA+) framework. Sepsis is characterized by a dysregulated host response to infection, involving profound immune cell heterogeneity. MOFA+ is a statistical model designed to disentangle this complexity by integrating multiple omics data types (e.g., transcriptomics, proteomics, epigenetics) measured on the same samples, revealing coordinated sources of variation (latent factors) driving biological and clinical phenotypes.

Core Principles of MOFA+

MOFA+ is a Bayesian group factor analysis model. Its core principles are:

  • Multi-View Integration: It learns a set of latent factors that are shared across multiple "views" (omics data sets) while allowing for view-specific variation.
  • Sparsity: It employs automatic relevance determination (ARD) priors to infer the number of relevant factors and to promote sparsity, meaning each factor influences only a subset of features (e.g., genes, proteins) and/or views.
  • Interpretability: The learned factors are interpretable as biological or technical sources of variation (e.g., immune cell activation, batch effects, clinical covariates).
  • Handling Heterogeneity: It robustly handles heterogeneous data types (continuous, count, binary) through appropriate likelihoods (Gaussian, Poisson, Bernoulli).

The model assumes that the observed data matrix for view m, Y^m, is a linear function of a low-dimensional latent matrix Z (factors) and view-specific weight matrices W^m, plus noise Ε^m.

Y^m = ZW^m^T + Ε^m

MOFA+ infers:

  • Z (Latent Factors): Low-dimensional representation of samples.
  • W (Weights): Importance of each feature for each factor in each view.
  • Θ (Precision parameters): Model noise and sparsity.

Application Notes for Sepsis Immune Heterogeneity

Data simulated based on typical sepsis omics integration studies.

Latent Factor (LF) Variance Explained (R²) - Transcriptomics Variance Explained (R²) - Proteomics Top Associated Features (Gene/Protein) Correlation with Clinical Trait (e.g., SOFA Score)
LF1 (Inflammatory Response) 22% 18% IL1B, TNF, S100A8 r = 0.75 (p<0.001)
LF2 (Immune Suppression) 15% 12% PDCD1, CTLA4, ARG1 r = -0.60 (p<0.001)
LF3 (Granulocyte Signature) 10% 5% MPO, ELANE, CXCR2 r = 0.30 (p=0.02)
LF4 (Batch Effect) 25% 22% - r = 0.05 (p=0.65)

Table 2: Key Research Reagent Solutions for Sepsis Multi-Omic Profiling

Reagent / Material Function / Explanation
PBMC Isolation Kit (e.g., Ficoll-Paque) Density gradient medium for isolating peripheral blood mononuclear cells from whole blood of sepsis patients and controls.
Single-Cell RNA-Seq Kit (e.g., 10x Genomics Chromium) Enables high-throughput transcriptomic profiling of individual immune cells to assess heterogeneity.
Olink Target 96/384 Inflammation Panel Multiplex immunoassay for precise, high-sensitivity quantification of inflammatory proteins in plasma.
CITE-seq Antibody Panel (TotalSeq) Allows simultaneous measurement of surface protein abundance and transcriptome in single cells.
ATAC-Seq Kit (Assay for Transposase-Accessible Chromatin) Profiles genome-wide chromatin accessibility to infer regulatory state of immune cells.
MOFA+ R/Python Package The core computational tool for integrating the above omics data sets and performing factor analysis.

Experimental Protocols

Protocol 5.1: Generating Multi-Omic Data from Sepsis Patient Samples

Objective: To generate transcriptomic and proteomic data from matched PBMC and plasma samples for MOFA+ integration. Materials: See Table 2. Procedure:

  • Sample Collection: Collect whole blood from sepsis patients (day 1, 3, 7) and healthy donors in EDTA tubes.
  • PBMC Isolation: Layer blood onto Ficoll-Paque. Centrifuge at 400 × g for 30-40 min at room temp (brake off). Harvest PBMC layer, wash twice with PBS.
  • Plasma Collection: Centrifuge whole blood at 2000 × g for 10 min at 4°C. Aliquot supernatant (plasma) and store at -80°C.
  • RNA Sequencing (Bulk): Extract total RNA from 1x10^6 PBMCs using a column-based kit. Assess quality (RIN > 7). Prepare libraries using a poly-A selection protocol. Sequence on an Illumina platform to a depth of 30 million paired-end reads per sample.
  • Proteomics (Plasma): Deplete top 14 abundant plasma proteins using an immunoaffinity column. Digest proteins with trypsin. Analyze peptides by liquid chromatography-tandem mass spectrometry (LC-MS/MS) using a data-independent acquisition (DIA) method.
  • Data Preprocessing: Map RNA-seq reads to the human genome (GRCh38) and quantify gene-level counts. Process DIA-MS data using a spectral library to obtain protein intensities. Log-transform and normalize all data matrices.

Protocol 5.2: Running MOFA+ on Sepsis Multi-Omic Data

Objective: To integrate processed transcriptomics and proteomics data and infer latent factors. Software: MOFA+ (R package version 1.8.0 or later). Procedure:

  • Data Input: Create a MultiAssayExperiment object in R containing two assays: "RNA" (normalized log-counts matrix) and "Proteomics" (log-intensity matrix). Rows are features, columns are matched samples.
  • MOFA Object Creation: Run create_mofa(data) and inspect the object structure.
  • Model Options: Set training options (TrainingOptions) with a convergence tolerance of 0.01 and 1000 maximum iterations. Set model options (ModelOptions) to use "gaussian" likelihoods for both views.
  • Model Training: Run run_mofa(mofa_object, outfile = "results.hdf5").
  • Factor Analysis: Examine the proportion of variance explained (R²) per view. Use plot_variance_explained(model).
  • Factor Interpretation: Correlate latent factors (get_factors(model)) with clinical metadata (e.g., SOFA score, survival) using Spearman correlation. Identify top feature weights (get_weights(model)) for each factor and perform gene set enrichment analysis (GSEA) on top-weighted genes.
  • Downstream Analysis: Use factors as continuous covariates in survival models or for patient stratification via clustering.

Visualizations

G cluster_omics Heterogeneous Omics Data (Views) cluster_latent Inferred Latent Space title MOFA+ Model Schematic for Sepsis Data RNA Transcriptomics (Log Counts) LF1 LF1 Inflammatory Response RNA->LF1 Weights_W1 LF2 LF2 Immune Suppression RNA->LF2 Weights_W2 LF3 LF3 Granulocyte Signature RNA->LF3 LFk LFk RNA->LFk Prot Proteomics (Log Intensities) Prot->LF1 Weights_W1 Prot->LF2 Weights_W2 Prot->LF3 Prot->LFk Clinical Clinical Traits (e.g., SOFA Score, Outcome) LF1->Clinical Correlation LF2->Clinical LFother ...

Title: MOFA+ Model Schematic for Sepsis Data

G title MOFA+ Sepsis Analysis Workflow S1 Sample Collection (Sepsis & Control Blood) S2 Multi-Omic Assays (RNA-seq, MS Proteomics) S1->S2 S3 Data Preprocessing (Normalization, QC) S2->S3 S4 MOFA+ Model Training (Factor Inference) S3->S4 S5 Factor Interpretation (Correlation, GSEA) S4->S5 S6 Biological Insight (Immune Trajectories, Biomarkers) S5->S6

Title: MOFA+ Sepsis Analysis Workflow

G cluster_pathways Enriched Pathways (GSEA) cluster_upgenes Top Weighted Genes title LF1 (Inflammatory) Pathway Associations LF1 Latent Factor 1 High Inflammatory Signal TNF TNF-α Signaling via NF-κB LF1->TNF Inflamm Inflammasome Activation LF1->Inflamm IL6 JAK-STAT (IL-6) Signaling LF1->IL6 Gene1 IL1B LF1->Gene1 Gene2 TNF LF1->Gene2 Gene3 S100A8 LF1->Gene3 Clinical High SOFA Score Poor Prognosis LF1->Clinical r = 0.75

Title: LF1 (Inflammatory) Pathway Associations

Within the broader thesis on applying Multi-Omics Factor Analysis (MOFA+) to deconvolute immune cell heterogeneity in sepsis, understanding core cellular concepts is paramount. Sepsis induces a profound dysregulation of the host immune response, characterized by concurrent hyperinflammation and immunosuppression. This application note details the key immune cell concepts—states, polarization, and exhaustion—that form the biological framework for constructing interpretable MOFA+ models. By integrating high-dimensional single-cell RNA sequencing (scRNA-seq), cytometry by time of flight (CyTOF), and proteomic data, MOFA+ can identify latent factors driving these pathological cell states, offering targets for stratified therapy.

Immune Cell States and Frequency in Sepsis

Immune cell states are transient, functional configurations driven by environmental signals. Sepsis causes a significant shift from homeostatic to disease-associated states.

Table 1: Alterations in Major Immune Cell Populations in Septic Patients vs. Healthy Controls

Cell Type Subset / State Change in Sepsis Reported Frequency in Sepsis (Mean ± SD or Range) Associated Outcome
Monocytes Classical (CD14++CD16-) ↓ Early, ↑ Late Varies Widely Early: Hyperinflammation
Intermediate (CD14++CD16+) ↑↑ 5-15% of monocytes (vs. 2-5% in HC) Cytokine Storm
Non-classical (CD14+CD16++) ↓↓ <1-2% of monocytes (vs. 5-10% in HC) Immunosuppression
T Cells CD4+ Naive ↓↓↓ 10-25% of CD4+ (vs. 40-60% in HC) Lymphopenia
CD4+ Effector Memory Increased proportion Variable
CD8+ Effector ↑ then ↓ Dynamic Initial response then exhaustion
Regulatory T cells (Tregs) 5-12% of CD4+ (vs. 2-5% in HC) Immunosuppression
Myeloid-Derived Suppressor Cells (MDSC) PMN-MDSC (CD15+) ↑↑↑ 20-50% of PBMCs in severe sepsis Strong immunosuppression
M-MDSC (CD14+) ↑↑ 10-30% of monocytes T cell inhibition

Polarization and Functional Output

Polarization refers to the differentiation of immune cells into distinct, functionally specialized effector phenotypes, often driven by cytokine milieus.

Table 2: Key Polarization Programs in Sepsis

Cell Type Phenotype Inducing Signals Key Transcriptional Regulators Functional Secretome
Macrophages M1-like (Pro-inflammatory) LPS, IFN-γ, GM-CSF STAT1, NF-κB, IRF5 TNF-α, IL-1β, IL-6, IL-12, iNOS
M2-like (Immunoregulatory) IL-4, IL-10, IL-13, Glucocorticoids STAT3, STAT6, IRF4, PPARγ IL-10, TGF-β, ARG1, VEGF
T Helper Cells Th1 IL-12, IFN-γ T-bet, STAT1, STAT4 IFN-γ, TNF-α, IL-2
Th2 IL-4 GATA3, STAT6 IL-4, IL-5, IL-13
Th17 IL-6, TGF-β, IL-21, IL-23 RORγt, STAT3 IL-17A/F, IL-22
Treg TGF-β, IL-2 FOXP3, STAT5 IL-10, TGF-β, IL-35

Exhaustion and Dysfunction

T cell exhaustion is a state of progressive dysfunction and impaired effector function, defined by sustained expression of inhibitory receptors and transcriptional rewiring.

Table 3: Markers of T Cell Exhaustion in Sepsis

Marker Category Specific Markers Change in Sepsis Exhaustion Functional Consequence
Inhibitory Receptors PD-1, CTLA-4, TIM-3, LAG-3, TIGIT ↑↑ (Co-expression defines severity) Attenuated TCR signaling, cell cycle arrest
Transcriptional Regulators TOX, NR4A, BATF Drives exhaustion epigenetic program
Metabolic Shift ↓ Mitochondrial biogenesis (PGC1α), ↑ Glycolysis Altered Reduced energetic capacity for proliferation
Effector Function Proliferation (Ki67), Cytokine Production (IFN-γ, TNF-α, IL-2) ↓↓↓ Inability to clear secondary infections

Experimental Protocols for Profiling Heterogeneity

Protocol: High-Parameter Cytometry (CyTOF) for Immune Cell State Profiling

Objective: To simultaneously quantify surface and intracellular markers defining cell identity, activation, polarization, and exhaustion in septic patient PBMCs. Reagents: See "The Scientist's Toolkit" below. Workflow:

  • Sample Preparation: Isolate PBMCs from heparinized blood of septic patients and healthy controls using density gradient centrifugation (Ficoll-Paque). Cryopreserve in FBS with 10% DMSO.
  • Antibody Staining:
    • Thaw and rest PBMCs overnight in RPMI-1640 + 10% FBS.
    • Stain with cisplatin (Cell-ID Cisplatin) for live/dead discrimination.
    • Block Fc receptors with human TruStain FcX.
    • Surface stain with metal-tagged antibody cocktail for 30 min at RT.
    • Fix cells with 1.6% formaldehyde (Freshly prepared from paraformaldehyde).
    • Permeabilize with ice-cold 100% methanol and store at -80°C overnight.
    • Intracellular stain with metal-tagged antibody cocktail for 30 min at RT.
  • Data Acquisition & Analysis:
    • Resuspend cells in deionized water containing 1:10 dilution of EQ Four Element Calibration Beads.
    • Acquire data on a Helios or CyTOF 2 mass cytometer at ~300-500 events/sec.
    • Normalize data using bead-based normalization.
    • Perform downstream analysis: debarcoding (for pooled samples), concatenation, viSNE/t-SNE, UMAP, and FlowSOM or PhenoGraph clustering.

G CyTOF Workflow for Sepsis Immune Profiling start PBMC Isolation (Ficoll Gradient) stain_live_dead Live/Dead Stain (Cisplatin) start->stain_live_dead stain_surface Surface Antibody Staining stain_live_dead->stain_surface fix Fixation (Formaldehyde) stain_surface->fix perm Permeabilization (100% Methanol) fix->perm stain_intra Intracellular Antibody Staining perm->stain_intra acquire Mass Cytometer Acquisition stain_intra->acquire norm Data Normalization (EQ Beads) acquire->norm cluster High-Dim Analysis (UMAP, FlowSOM) norm->cluster

Protocol: Single-Cell RNA Sequencing (scRNA-seq) for State & Exhaustion Analysis

Objective: To profile the transcriptional landscape of immune cells, identifying novel states, polarization trajectories, and exhaustion signatures. Workflow (10x Genomics Platform):

  • Cell Preparation: Generate a high-viability (>90%) single-cell suspension of PBMCs or sorted immune cell populations. Adjust to 700-1200 cells/μL.
  • Library Generation: Use the Chromium Next GEM Single Cell 5' v2 or 3' v3.1 kit. Key steps:
    • Gel Bead-In-Emulsions (GEMs): Combine cells, Master Mix, and Gel Beads with barcoded oligonucleotides on a Chromium Chip.
    • Reverse Transcription: Inside each GEM, RNA is reverse-transcribed, adding a cell barcode and Unique Molecular Identifier (UMI).
    • cDNA Amplification: Break emulsions, purify cDNA, and amplify by PCR.
    • Library Construction: Fragment, size-select, and index cDNA to add sample index and sequencing adapters.
  • Sequencing & Analysis: Sequence on Illumina NovaSeq (aim for 50,000 reads/cell). Process using Cell Ranger pipeline (alignment, barcode counting, UMI aggregation). Downstream analysis in R (Seurat package): QC, normalization, integration (if multiple samples), PCA, clustering (FindNeighbors, FindClusters), UMAP visualization, and differential gene expression (FindMarkers).

G scRNA-seq Workflow from Cells to Clusters cell_susp Single-Cell Suspension (>90% viability) gem_rt GEM Generation & Reverse Transcription cell_susp->gem_rt cdna_amp cDNA Amplification & Purification gem_rt->cdna_amp lib_prep Library Preparation (Fragmentation, Indexing) cdna_amp->lib_prep seq Illumina Sequencing lib_prep->seq cr_pipe Cell Ranger Processing (Alignment, Counting) seq->cr_pipe seurat Seurat Analysis: QC, PCA, Clustering, UMAP cr_pipe->seurat

The Scientist's Toolkit: Research Reagent Solutions

Table 4: Essential Reagents for Profiling Immune Heterogeneity in Sepsis

Reagent / Kit Vendor Examples Function in Experiment
Ficoll-Paque PLUS Cytiva, MilliporeSigma Density gradient medium for isolating PBMCs from whole blood.
Cell-ID Intercalator-Ir (CyTOF) Standard BioTools DNA intercalator for cell event discrimination and normalization in mass cytometry.
Cell-ID 20-Plex Pd Barcoding Kit Standard BioTools Allows multiplexing of up to 20 samples in a single CyTOF run, reducing batch effects.
Maxpar X8 Antibody Labeling Kits Standard BioTools For custom conjugation of purified antibodies to rare-earth metals for CyTOF.
TruStain FcX (Fc Receptor Blocking Solution) BioLegend Blocks non-specific antibody binding via Fc receptors, reducing background.
Chromium Next GEM Single Cell 5' Kit v2 10x Genomics End-to-end solution for generating scRNA-seq libraries, capturing 5' ends for immune profiling.
Feature Barcode technology (CITE-seq) 10x Genomics Allows simultaneous measurement of surface protein (antibody-derived tags) and transcriptome in single cells.
Foxp3 / Transcription Factor Staining Buffer Set Thermo Fisher, BioLegend Permeabilization buffers optimized for intracellular staining of transcription factors (e.g., FOXP3, T-bet).
LegendPlex Human Inflammation Panel 1 BioLegend Bead-based multiplex immunoassay for quantifying 13 inflammatory cytokines from serum or supernatant.
CellTrace Violet / CFSE Cell Proliferation Kits Thermo Fisher Fluorescent dyes to track lymphocyte proliferation in vitro following sepsis plasma stimulation.

MOFA+ Integration Workflow Diagram

G MOFA+ Integrates Multi-Omics to Define Sepsis States omics1 scRNA-seq (Transcriptome) mofa MOFA+ Model (Joint Dimensionality Reduction) omics1->mofa omics2 CyTOF (Protein / Surface) omics2->mofa omics3 Bulk/Spatial Proteomics omics3->mofa factors Latent Factors (Drivers of Heterogeneity) mofa->factors interp1 Factor 1: Myeloid Inflammation Signature factors->interp1 interp2 Factor 2: T Cell Exhaustion Program factors->interp2 interp3 Factor 3: M2-like Polarization Axis factors->interp3

Sepsis, a dysregulated host response to infection, is characterized by profound immune system heterogeneity, driving divergent patient trajectories. Traditional single-omics immunophenotyping (e.g., flow cytometry, bulk transcriptomics) fails to capture this complexity, presenting critical gaps:

  • Limited Dimensionality: Flow cytometry is constrained by fluorescent spectral overlap (~20-30 parameters), offering a narrow view of cellular states.
  • Lack of Multi-omics Integration: Bulk transcriptomics and cytometry measure different biological layers in isolation, preventing a unified view of genotype-to-phenotype relationships.
  • Inability to Deconvolve Co-variation: Traditional methods cannot systematically identify shared and unique sources of variation across patients, time points, and data modalities.
  • Oversimplification of States: Manual gating and clustering in cytometry impose discrete categorizations on continuous immune cell states.

Multi-Omics Factor Analysis+ (MOFA+) is a Bayesian statistical framework that addresses these gaps by integrating multiple omics datasets (e.g., single-cell RNA-seq, CITE-seq, proteomics) measured on the same samples to discover the principal axes of variation (factors) that drive heterogeneity across all data types.

Key Advantages of MOFA+ in Sepsis Research

MOFA+ provides a data-driven, unified model of sepsis immunology.

Traditional Method Gap MOFA+ Solution Impact on Sepsis Research
Limited Dimensionality Model latent factors explaining variance across 100s-1000s of features (genes, proteins). Identifies continuous, multi-feature immune dysregulation axes (e.g., "myeloid dysfunction," "T cell exhaustion").
Isolated Omics Views Joint decomposition of multi-omics data (scRNA-seq + surface protein + chromatin). Reveals coordinated transcriptional and proteomic changes defining novel cell states.
Inability to Model Shared Variation Distinguishes variation shared across omics layers from that unique to a specific layer. Separates technical noise from biological signal; identifies core drivers of sepsis shared by all data types.
Discrete, Manual Gating Data-driven, continuous factors capture gradients of cell states. Discovers intermediate, transitional cell states predictive of outcome.
Poor Handling of Missing Data Robust handling of missing values (e.g., missing protein ab measurements in some cells). Enables integration of sparse CITE-seq data and unbalanced patient cohorts.

Application Notes: MOFA+ for Sepsis Patient Stratification

Objective: To identify latent factors of immune variation that stratify septic patients into endotypes with distinct outcomes and molecular drivers.

Data Input: Single-cell multi-omics data (CITE-seq: RNA + 50 surface proteins) from peripheral blood mononuclear cells (PBMCs) of 40 sepsis patients (day 1) and 10 healthy controls.

MOFA+ Workflow & Analysis:

MOFA_Workflow cluster_0 Key Outputs Data Multi-omics Data (scRNA + ADT) Prep Data Preparation & Input Matrices Data->Prep Train Train MOFA+ Model Prep->Train Anal Factor Analysis & Interpretation Train->Anal Valid Biological Validation & Endotyping Anal->Valid FO Factor Overview (Factor vs. Variance) FL Factor Loadings (Key Features per Factor) FW Factor Weights (Sample/Cell Scores)

MOFA+ Analysis Workflow for Sepsis Stratification

Protocol 3.1: MOFA+ Model Training

  • Data Preprocessing: Generate normalized count matrices for each modality. For RNA, use log(CP10K+1). For antibody-derived tags (ADT), use centered log-ratio (CLR) transformation.
  • Create MOFA Object: Use create_mofa() function in R/Python, specifying the two data views ("RNA", "ADT"). Specify sample (patient ID) and group (optional, e.g., outcome) metadata.
  • Model Training & Dimensionality: Run run_mofa() with default training options. Use automatic relevance determination to prune irrelevant factors. Typically, 10-15 factors are sufficient to explain >80% of total variance.
  • Model Evaluation: Inspect the model convergence (ELBO plot) and variance explained per view (R²).

Protocol 3.2: Factor Interpretation & Patient Stratification

  • Identify Biological Factors: Correlate factor values with sample metadata (e.g., SOFA score, survival). Plot factor values per patient.
  • Annotate Factors via Loadings: Extract top-weighted genes (RNA loadings) and proteins (ADT loadings) for each factor. Perform pathway enrichment (e.g., Reactome) on top gene loadings.
  • Define Patient Endotypes: Perform unsupervised clustering (k-means or hierarchical) on the matrix of patient-level factor scores. This yields molecular endotypes (e.g., "Hyper-inflammatory," "Immunosuppressed," "Intermediate").
  • Validate Endotypes: Compare clinical outcomes (28-day mortality, organ failure) across endotypes using Kaplan-Meier and logistic regression.

Table 1: Example MOFA+ Output - Sepsis Immune Dysregulation Factors

Factor % Variance Explained (RNA / ADT) Top Gene Loadings (Pathway) Top Protein Loadings Clinical Correlation Proposed Biology
Factor 1 22% / 18% S100A8/9, IL1B, CXCL8 (Inflammation) CD64, CD11b, CD62L(lo) + SOFA Score Myeloid Activation & Emergency Granulopoiesis
Factor 2 15% / 12% PDCD1, LAG3, TOX (Exhaustion) PD-1, Tim-3, HLA-DR(lo) + Secondary Infection Global T & NK Cell Exhaustion
Factor 3 8% / 20% MKI67, TOP2A, BIRC5 (Cell Cycle) CD38, CD71 - Age, + Recovery Proliferative Immune Reconstitution

Detailed Protocol: Integrating Time-Series scRNA-seq Data

Objective: To model the dynamic evolution of the immune response in sepsis survivors vs. non-survivors over time.

Data: scRNA-seq data from 30 patients at Days 1, 3, and 7 post-ICU admission.

MOFA+ for Time-Series Analysis: MOFA+ treats each time point as a separate group. This allows identification of factors with variance that is consistent across groups (shared), specific to one time point (group-specific), or shared across a subset.

TimeSeries Input Time-Series scRNA-seq Data (Patients x 3 Time Points) Group Define Groups: Day1, Day3, Day7 Input->Group Model Train Multi-Group MOFA+ Model Group->Model VarPlot Plot Variance Explained per Group Model->VarPlot FactorTypes Identify Factor Types VarPlot->FactorTypes Shared Shared Factor (e.g., Chronic Inflammation) FactorTypes->Shared High Variance All Groups Specific Day1-Specific Factor (e.g., Early Shock Response) FactorTypes->Specific Variance Only in One Group Trajectory Trajectory Analysis FactorTypes->Trajectory Connect Factors Across Time

MOFA+ for Multi-Group Time-Series Analysis

Protocol 4.1: Multi-Group Model Setup & Training

  • Data Structuring: Create a single RNA matrix encompassing all cells from all patients and time points. Create a sample metadata column specifying the group (e.g., PatientA_Day1, PatientA_Day3).
  • Group Definition: Pass the group labels to the create_mofa() function using the groups argument.
  • Training Options: Enable the scale_views option to account for global differences in variance between time points. Use a slightly higher number of factors (e.g., 20).
  • Variance Decomposition: After training, use plot_variance_explained(model, plot="group") to visualize how much variance each factor explains in each group (time point).

Protocol 4.2: Dynamic Factor Trajectory Analysis

  • Extract Sample Scores: Obtain the factor score for each sample (patient-time point combination).
  • Plot Longitudinal Trajectories: For factors of interest, plot the median factor score per time point, stratified by patient outcome (survivor/non-survivor). Use linear mixed-effects models to test for significant outcome-by-time interactions on factor scores.
  • Identify State Drivers: For factors that diverge over time between outcomes, analyze the evolution of their gene loadings. This pinpoints the specific transcriptional programs driving divergent recovery.

Table 2: Key Research Reagent Solutions for MOFA+-Integrated Sepsis Studies

Item / Solution Function in MOFA+ Sepsis Research Example / Provider
High-Parameter Cytometry Provides rich proteomic input data for MOFA+ integration (surface markers, signaling states). BD FACSymphony, CyTOF (Fluidigm)
CITE-seq Kits Enables simultaneous measurement of RNA and surface protein (ADT) from single cells—ideal paired data for MOFA+. BioLegend TotalSeq, 10x Genomics Feature Barcoding
Fixed RNA Profiling Assays Allows profiling of samples with temporal or spatial separation, preserving sample alignment for MOFA+. 10x Genomics Visium & Xenium, Parse Biosciences Evercode
Cell Hashing Reagents Multiplex samples, reducing batch effects and ensuring patient/time-matched cells across omics layers. BioLegend TotalSeq-H, 10x Genomics CellPlex
MOFA+ Software Package Core statistical framework for multi-omics integration and factor analysis. R/Python package on GitHub (www.biofam.github.io/MOFA2)
Immune Reference Atlases Provide prior knowledge for interpreting MOFA+ factor loadings (e.g., cell-type signatures). DICE, Human Cell Atlas, ImmGen
Pathway Analysis Tools Functional annotation of top gene loadings from MOFA+ factors. fgsea (R), Enrichr, Ingenuity Pathway Analysis

Application Notes for Immune Cell Heterogeneity in Sepsis Research

Multi-Omics Factor Analysis+ (MOFA+) is a statistical framework for the integration of multi-omics datasets. In the context of sepsis research, a disease characterized by a dysregulated host response to infection leading to life-threatening organ dysfunction, MOFA+ is instrumental for disentangling the sources of immune cell heterogeneity. Sepsis induces profound and complex changes across cellular transcriptional states, surface protein expression, and secreted signaling molecules. By integrating compatible data modalities, MOFA+ can identify coordinated patterns of variation (factors) across these layers, revealing novel patient endotypes, drivers of immunosuppressive or hyperinflammatory states, and potential therapeutic targets.

The successful application of MOFA+ hinges on the proper preprocessing and formatting of input data types. The following modalities are directly compatible and highly relevant for sepsis immunology.

Table 1: MOFA+-Compatible Data Types for Sepsis Immunology

Data Type Measured Features Typical Scale Key Insight for Sepsis Preprocessing for MOFA+
scRNA-seq Gene expression (mRNA) Single-cell Cell-type-specific transcriptional programs, novel subtypes, trajectory inference. Counts → Log-normalization (e.g., log1p(CP10K)). Filter lowly expressed genes/variable gene selection.
CITE-seq mRNA + Surface Proteins Single-cell Paired transcriptomic and proteomic (20-200+ markers) view of cell identity and state. RNA: As above. ADT (proteins): CLR normalization (centered log-ratio) per cell to correct for ambient noise.
CyTOF Surface & Intracellular Proteins Single-cell (high-dimensional) Deep immunophenotyping (40-50+ markers), signaling pathway activity (phospho-proteins). Arcsinh transformation (cofactor=5). Downsampling or aggregation may be required for large cohorts.
Bulk/Spatial Proteomics Soluble Proteins (cytokines, analytes) Bulk tissue or plasma Systemic inflammatory response, organ-specific signatures, biomarker discovery. Log-transformation. Appropriate scaling (e.g., Z-score) across samples.

Detailed Experimental Protocols

Protocol 1: Generation of a CITE-seq Dataset from Septic Patient PBMCs

Objective: To generate a paired single-cell transcriptome and surface proteome profile from Peripheral Blood Mononuclear Cells (PBMCs) of septic patients and controls for MOFA+ integration.

Materials: See "Research Reagent Solutions" below.

Procedure:

  • PBMC Isolation: Draw whole blood into heparin tubes. Dilute blood 1:1 with PBS. Layer over Ficoll-Paque PLUS density gradient medium. Centrifuge at 400-500 × g for 30-35 minutes at room temperature (brake off). Collect the PBMC interface layer. Wash twice with PBS + 0.04% BSA.
  • Cell Staining with Antibody-Derived Tags (ADTs): Resuspend up to 1×10^6 cells in 50µl of PBS + 0.04% BSA. Add a pre-titrated cocktail of TotalSeq antibodies. Incubate for 30 minutes on ice in the dark. Wash cells three times with cold PBS + 0.04% BSA.
  • Single-Cell Partitioning and Library Preparation: Count and assess viability (trypan blue). Load cells, beads, and reagents onto the 10x Genomics Chromium Chip according to the Chromium Next GEM Single Cell 5' v2 protocol. This captures RNA and antibody-derived tags on separate beads.
  • cDNA & ADT Library Construction: Follow the manufacturer's protocol for reverse transcription, cDNA amplification, and library construction. Construct separate gene expression (cDNA) and feature (ADT) libraries using distinct index primers.
  • Sequencing: Pool libraries. Sequence on an Illumina platform. Recommended depth: ≥20,000 reads/cell for gene expression, ≥5,000 reads/cell for ADTs.
Protocol 2: Phospho-Protein Profiling by CyTOF for Signaling States in Sepsis

Objective: To quantify phosphorylation states of key signaling proteins (e.g., pSTAT, pERK, pS6) in immune cell subsets from septic patients.

Procedure:

  • Cell Stimulation & Fixation: Aliquot 2-3×10^6 PBMCs per condition. Stimulate with relevant cytokines (e.g., IL-6, IFNγ, LPS) or leave unstimulated for 15 minutes at 37°C. Immediately add 1× MaxPar Cell Staining Buffer (CSB) and fix cells with 1.6% formaldehyde for 10 minutes at RT.
  • Barcoding: Use a palladium-based barcoding kit (e.g., Cell-ID 20-Plex Pd) to pool samples, reducing technical variability. Permeabilize cells with ice-cold methanol and store at -80°C overnight.
  • Antibody Staining: Centrifuge to remove methanol. Wash twice with CSB. Block with Fc receptor block. Incubate with a pre-conjugated metal-tagged antibody cocktail (surface markers + intracellular phospho-targets) for 1 hour at RT.
  • DNA Staining and Acquisition: Wash cells. Resuspend in intercalator solution (e.g., 125nM Cell-ID Intercalator-Ir in PBS) to label DNA. Acquire on a CyTOF mass cytometer, calibrating daily with EQ beads.
  • Data Processing: Normalize data using bead-based normalization. Debarcode samples. Preprocess for MOFA+: apply arcsinh transformation (cofactor=5), gate on single, live (DNA+), immune cells. Export median marker intensity per cell population (e.g., CD4+ T cells, monocytes) per sample.

Diagrams

workflow Patient Septic Patient & Control Cohorts Mod1 scRNA-seq (Transcriptome) Patient->Mod1 Mod2 CITE-seq (Transcriptome + Surface Proteome) Patient->Mod2 Mod3 CyTOF (High-dim Proteome & Phospho) Patient->Mod3 Mod4 Bulk Proteomics (Plasma Cytokines) Patient->Mod4 Prep1 Preprocessing: Log-Norm, Variable Genes Mod1->Prep1 Prep2 Preprocessing: RNA Log-Norm, ADT CLR Mod2->Prep2 Prep3 Preprocessing: Arcsinh Transform Mod3->Prep3 Prep4 Preprocessing: Log-Transform, Z-Score Mod4->Prep4 Mofa MOFA+ Model Training Prep1->Mofa Prep2->Mofa Prep3->Mofa Prep4->Mofa Factors Latent Factors (Drivers of Heterogeneity) Mofa->Factors Output Analysis & Interpretation: - Patient Endotypes - Multi-omic Drivers - Biomarker Discovery Factors->Output

Title: MOFA+ Integration Workflow for Sepsis Multi-omics Data

sepsis_path LPS Pathogen PAMPs (e.g., LPS) TLR4 TLR4 Receptor LPS->TLR4 MyD88 MyD88 Adaptor TLR4->MyD88 NFkB NF-κB Activation MyD88->NFkB Cytokines Pro-inflammatory Cytokine Release (IL-6, TNFα, IL-1β) NFkB->Cytokines STAT3 pSTAT3 Signaling Cytokines->STAT3 MofaFactor MOFA+ Factor 1 'Hyperinflammation' Cytokines->MofaFactor PD1 Exhaustion Marker (PD-1) IL10 Immunosuppressive Response (IL-10, TGFβ) PD1->IL10 STAT3->IL10

Title: Key Sepsis Signaling Pathways Captured by Multi-omics


The Scientist's Toolkit: Research Reagent Solutions

Item Supplier Example Function in Sepsis Multi-Omics
Ficoll-Paque PLUS Cytiva Density gradient medium for isolation of viable PBMCs from septic blood.
TotalSeq Antibodies BioLegend Antibody-derived tags (ADTs) for CITE-seq, enabling simultaneous surface protein detection with 10x Genomics.
Chromium Next GEM Kit 5' v2 10x Genomics Reagents for single-cell partitioning and library prep of 5' gene expression and ADT libraries.
Cell-ID 20-Plex Pd Barcoding Kit Standard BioTools Palladium-based barcoding kit for multiplexing up to 20 CyTOF samples, reducing batch effects.
MaxPar Metal-Conjugated Antibodies Standard BioTools Antibodies conjugated to rare-earth metals for CyTOF, targeting surface markers and phospho-epitopes.
LEGENDplex Human Inflammation Panel BioLegend Bead-based immunoassay for quantifying 13 inflammatory cytokines in plasma/serum for bulk proteomics.
MOFA+ R/Python Package GitHub (bioFAM) Core software tool for multi-omics integration and factor analysis.
Seurat R Toolkit Satija Lab Primary tool for preprocessing, normalization, and analysis of scRNA-seq and CITE-seq data prior to MOFA+.

Step-by-Step Workflow: Applying MOFA+ to Your Sepsis Immunomics Dataset

This protocol constitutes the foundational Stage 1 of a comprehensive thesis applying Multi-Omics Factor Analysis plus (MOFA+) to deconvolute immune cell heterogeneity in sepsis. The precise characterization of patient-specific immune states—ranging from hyperinflammation to immunoparalysis—is confounded by significant technical noise inherent in clinical sample processing. This stage details the standardized preprocessing and rigorous quality control (QC) pipeline essential for generating reliable, high-quality single-cell RNA sequencing (scRNA-seq) and bulk proteomic data from septic patient peripheral blood mononuclear cells (PBMCs). Robust data from this stage is a prerequisite for subsequent MOFA+ integration and the identification of latent factors driving patient stratification and outcomes.

Sample Acquisition & Initial Processing Protocol

Materials and Reagent Setup

  • Whole Blood Collection: Collect 20-30mL of peripheral blood into sodium heparin or EDTA vacutainers from septic patients (meeting SEPSIS-3 criteria) and matched healthy controls within 24 hours of ICU admission. Process within 2 hours of draw.
  • PBMC Isolation: Using a density gradient medium (e.g., Ficoll-Paque PLUS). Prepare Leucosep tubes with 15mL of room-temperature PBS in the upper chamber.
  • Cell Viability & Counting: Trypan Blue (0.4%) or AO/PI staining solution. Automated cell counter or hemocytometer.
  • Cryopreservation Medium: 90% Fetal Bovine Serum (FBS) + 10% DMSO. Pre-cool in a controlled-rate freezing apparatus or -80°C isopropanol chamber.

Step-by-Step Protocol

  • Blood Dilution: Dilute blood 1:1 with room-temperature Dulbecco's Phosphate-Buffered Saline (DPBS), without Ca2+/Mg2+.
  • Density Gradient Centrifugation:
    • Carefully layer 25mL of diluted blood over 15mL of Ficoll-Paque in a Leucosep tube.
    • Centrifuge at 800 × g for 20 minutes at 20°C, with the brake OFF.
    • Aspirate the plasma layer above the porous barrier. Transfer the mononuclear cell layer at the interface to a new 50mL conical tube.
  • PBMC Washing:
    • Wash cells with 50mL of DPBS. Centrifuge at 350 × g for 10 minutes at 20°C.
    • Aspirate supernatant. Resuspend pellet in 10mL of Red Blood Cell Lysis Buffer (e.g., ACK). Incubate for 3-5 minutes at RT.
    • Quench with 40mL of DPBS and centrifuge at 350 × g for 5 minutes.
    • Repeat DPBS wash twice more.
  • Cell Counting & Viability Assessment:
    • Resuspend final pellet in 5mL of DPBS + 0.04% BSA.
    • Mix 10µL of cell suspension with 10µL of Trypan Blue. Count live (unstained) and dead (blue) cells.
  • Cryopreservation:
    • Pellet required number of cells. Resuspend gently in cold cryopreservation medium at 5-10 × 10^6 cells/mL.
    • Aliquot 1mL into cryovials. Place vials in a controlled-rate freezer, cooling at -1°C/min to -80°C, then transfer to liquid nitrogen vapor phase for long-term storage.

Quality Control Metrics & Thresholds

All quantitative QC data must pass the following thresholds prior to downstream analysis.

Table 1: Mandatory QC Metrics for scRNA-seq Data (10x Genomics Platform)

Metric Target Range Failure Action
Estimated Number of Cells Within 10% of loaded cell count Check cell concentration and viability
Median Genes per Cell > 1,500 for PBMCs Filter out low-quality cells; investigate dissociation
Median UMI Counts per Cell > 3,000 for PBMCs Filter out low-quality cells
Mitochondrial Gene Percentage < 10% (Healthy), < 20% (Septic) Filter high-% cells; indicates apoptosis/ stress
Ribosomal Protein Gene Percentage 5-20% Outliers may indicate technical artifacts
Doublet Rate (Scrublet Estimate) < 5% Remove predicted doublets
Total Genes Detected (Library Complexity) > 20,000 Sample may be undersequenced; increase depth

Table 2: Mandatory QC Metrics for Bulk Proteomics (Olink/LC-MS)

Metric Target Failure Action
Sample Intensity CV (Internal Controls) < 10% Check sample handling and assay procedure
Sample Detection Rate > 85% of assays above LOD Re-run if low; indicates poor sample quality
Inter-plate Control CV < 15% Normalize across plates using controls
Sample-to-Sample Correlation R > 0.9 for replicates Identify and remove outliers

Computational Preprocessing Workflow

scRNA-seq Processing with Cell Ranger & Seurat

  • Demultiplexing & Alignment: Run cellranger mkfastq and cellranger count (GRCh38 reference) to generate raw feature-barcode matrices.
  • Initial Seurat Object Creation: Load data into Seurat (v5). Create object, retaining cells with unique feature counts between 500 and 6000 and <20% mitochondrial counts.
  • Normalization & Scaling: Normalize data using SCTransform (recommended for heterogeneity) or LogNormalize. Regress out effects of mitochondrial percentage and cell cycle score (using CellCycleScoring).
  • High-Variance Feature Selection: Identify 2000-3000 highly variable features for downstream dimensionality reduction.
  • Doublet Removal: Apply scDblFinder or DoubletFinder to identify and remove computational doublets.
  • Batch Effect Evaluation: Use FastMNN or Harmony integration if strong batch effects (e.g., processing day) are observed via PCA/UMAP visualization.

Proteomics Data Processing

  • Normalization: For Olink data, apply intra- and inter-plate normalization using NPX Manager software. For LC-MS, use median or quantile normalization.
  • Imputation: For values below detection limit, use a left-censored imputation method (e.g., impute.QRILC from R's imputeLCMD package) or replace with LOD/√2.
  • Log Transformation: Apply log2 transformation to stabilize variance.
  • Batch Correction: Apply ComBat or limma removeBatchEffect if technical batches are identified via PCA.

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials for Sample Prep & QC

Item Function/Benefit Example Product
Ficoll-Paque PLUS Density gradient medium for gentle, high-yield PBMC isolation. Cytiva, #17144002
Leucosep Tubes Centrifuge tubes with porous barrier for streamlined PBMC isolation, minimizing platelet contamination. Greiner Bio-One, #227290
ACK Lysing Buffer Ammonium-Chloride-Potassium buffer for efficient RBC lysis with low leukocyte damage. Gibco, #A1049201
DMSO (Cell Culture Grade) Cryoprotectant for viable long-term cell storage. Sigma-Aldrich, #D2650
Trypan Blue Solution (0.4%) Vital dye for distinguishing live (excluded) from dead (stained) cells in counting. Gibco, #15250061
Chromium Next GEM Chip K Microfluidic chip for partitioning single cells and barcoded beads in 10x Genomics workflows. 10x Genomics, #1000127
Single Cell 3' GEM, Library & Gel Bead Kit v3.1 Reagents for generating barcoded scRNA-seq libraries. 10x Genomics, #1000121
Olink Target 96/384 Inflammation Panel Multiplex immunoassay for precise quantification of 92 inflammation-related proteins from low sample volume. Olink, #95305
Cell Ranger Analysis Software End-to-end analysis pipeline for demultiplexing, barcode processing, and UMI counting of 10x data. 10x Genomics (Free)

Visualization of the Preprocessing & QC Workflow

G start Whole Blood Collection (Septic & Healthy) pbmc PBMC Isolation (Density Gradient) start->pbmc qc1 Cell Count & Viability QC pbmc->qc1 qc1->pbmc Fail store Cryopreservation qc1->store Pass seq scRNA-seq (10x Genomics) store->seq prot Bulk Proteomics (Olink/LC-MS) store->prot proc_s Computational Processing (Cell Ranger, Seurat) seq->proc_s proc_p Computational Processing (Normalization, Imputation) prot->proc_p qc2_s Filtering & QC (Genes/Cell, MT%, Doublets) proc_s->qc2_s qc2_p Filtering & QC (Detection Rate, CV, Correlation) proc_p->qc2_p qc2_s->seq Fail out Curated Quality- Controlled Matrices (Ready for MOFA+) qc2_s->out Pass qc2_p->prot Fail qc2_p->out Pass

Title: Septic Sample Preprocessing and QC Workflow Diagram

G cluster_path Key Signaling Pathways Impacting Sepsis Immune Heterogeneity TLR TLR/NF-κB Pathway Hyperinflammatory State Key Proteins: MYD88, TRAF6, NFKB1 MOFA MOFA+ Integration Identifies Latent Factors Linking Pathways to Patient Outcomes TLR->MOFA STAT JAK-STAT Pathway Immunoparalysis State Key Proteins: STAT1, STAT3, SOCS3 STAT->MOFA IFN Type I Interferon Response Antiviral State Key Proteins: IRF7, ISG15 IFN->MOFA MHC Antigen Presentation Dysfunctional Response Key Proteins: HLA-DR, CIITA MHC->MOFA

Title: Key Immune Pathways Integrated by MOFA+ in Sepsis

Within the broader thesis investigating immune cell heterogeneity in sepsis using multi-omics integration, Stage 2 focuses on the critical construction of the Multi-Omics Factor Analysis plus (MOFA+) model. This stage involves the strategic setting of key parameters that determine the model's ability to extract biologically meaningful latent factors from complex data sets (e.g., transcriptomics, proteomics, cytometry from septic patient PBMCs). Proper configuration of factors, likelihoods, and sparsity is paramount for generating interpretable results that can elucidate patient-specific immune dysregulation.

Key Parameter Definitions & Rationale

Number of Factors

The number of latent factors (num_factors) defines the dimensionality of the latent space. Over-specification leads to noise modeling; under-specification misses biological signal.

  • Recommended Setting: Start with 15-25 factors for complex immune cell heterogeneity studies. The model can automatically prune irrelevant factors if total_variance_explained threshold is set.
  • Thesis Context: In sepsis, factors may correspond to distinct immune cell states (e.g., immunosuppressive monotype, hyperactivated T-cell), metabolic shifts, or response to specific pathogens.

Likelihoods

Likelihoods specify the statistical distribution for each data view, ensuring proper modeling of different data types.

  • Critical for: Handling dropouts in single-cell RNA-seq, bounded CyTOF data, or binary mutation data.

Sparsity

Sparsity encourages the model to assign loadings of zero for most features on most factors, enhancing interpretability by linking each factor to a small, defined set of omics features.

  • Mechanism: Implemented via Automatic Relevance Determination (ARD) or spike-and-slab priors.
  • Benefit: In sepsis research, this reveals key driver genes/proteins per immune signature.

Table 1: Recommended MOFA+ Parameters for Sepsis Multi-Omics Integration

Parameter Description Recommended Setting for Sepsis Studies Justification & Impact
num_factors Number of latent factors to model. 15-25 (initial). Use automatic pruning. Balances complexity and signal capture. Pruning removes factors explaining <2-3% variance.
likelihoods Statistical distribution per data view. "gaussian": for log-normalized bulk RNA-seq, protein."poisson": for raw count data (use cautiously)."bernoulli": for binary mutation/CHIP data. Correct likelihood prevents bias. Gaussian is robust for most transformed assays.
sparsity Enforce feature sparsity per factor. TRUE (default). Use spike-and-slab prior. Critical for interpretability. Identifies key discriminatory omics features per immune phenotype.
ard_factors ARD prior on factors (prunes unused factors). TRUE (recommended). Automatically infers the number of relevant factors from the initial guess.
ard_weights ARD prior on weights (encourages sparsity). TRUE (recommended). Works in tandem with spike-and-slab to enforce feature-level sparsity.
total_variance_threshold Min. variance for factor retention. 2.0% (range: 0.5-3.0%). Prunes factors explaining negligible variance, focusing on biologically meaningful drivers.

Experimental Protocol: Building a MOFA+ Model for Sepsis CyTOF & scRNA-seq Data

Objective: To integrate matched peripheral blood single-cell RNA-seq and CyTOF (surface protein) data from septic patients and controls to identify coordinated immune cell programs.

Materials: Pre-processed data matrices (cells x features) for each view.

Procedure:

  • Data Preparation: Ensure each data view is a samples (cells) x features matrix. Remove features with near-zero variance.
    • scRNA-seq: Log-normalize counts (e.g., log1p(CP10K)). Top 3000-5000 highly variable genes.
    • CyTOF: Arcsinh transform with cofactor=5. Use all major lineage and functional markers (~40).
  • Create MOFA Object & Set Parameters:

  • Prepare & Train the Model:

  • Model Inspection & Factor Pruning:

    • Plot total variance explained (plot_variance_explained(mofa_trained)).
    • Inspect factor correlations (plot_factor_cor(mofa_trained)).
    • Prune factors explaining less than the set variance threshold (e.g., 2%).

Signaling Pathway & Workflow Diagrams

G cluster_params Core Parameter Decisions Start Multi-omics Data (scRNA-seq, CyTOF, Plasma Proteomics) P1 Data Pre-processing (Normalization, HVG selection) Start->P1 P2 Set Key Parameters: Factors, Likelihoods, Sparsity P1->P2 P3 Build MOFA+ Model (Train using VB inference) P2->P3 K Number of Factors (Initial: 15-25) P2->K L Likelihoods (Gaussian, Poisson) P2->L S Sparsity (TRUE; spike-and-slab) P2->S P4 Model Diagnostics (Variance Explained, Factor Cor) P3->P4 P5 Downstream Analysis: Factor vs. Clinical Metadata Driver Feature Identification P4->P5 End Interpretable Latent Factors (Immune Cell Programs in Sepsis) P5->End

Diagram Title: MOFA+ Model Building Workflow for Sepsis Multi-omics

G Factor Latent Factor Z Weights1 Weight Matrix W_rna Factor->Weights1  Loadings Weights2 Weight Matrix W_cytof Factor->Weights2  Loadings Data1 scRNA-seq View (Gaussian Likelihood) Weights1->Data1  Models Data2 CyTOF View (Gaussian Likelihood) Weights2->Data2  Models Sparsity Sparsity Prior (Spike-and-Slab / ARD) Sparsity->Weights1  Regulates Sparsity->Weights2  Regulates

Diagram Title: MOFA+ Graphical Model with Sparsity

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Reagents & Tools for Sepsis Multi-omics MOFA+ Analysis

Item Function in Protocol Example Product/Software (Non-exhaustive)
Single-Cell RNA-seq Kit Generate transcriptome view from PBMCs. 10x Genomics Chromium Next GEM Single Cell 3’ Kit.
CyTOF Antibody Panel Tag metal isotopes to antibodies for deep immunophenotyping. Maxpar Direct Immune Profiling Assay (Standardized Panel).
Cell Hashing/Oligo-tagged Antibodies Multiplex samples for scRNA-seq to reduce batch effects. BioLegend TotalSeq-C Anti-Human Hashtag Antibodies.
Viability Stain Distinguish live/dead cells prior to sequencing/CyTOF. Cisplatin (for CyTOF), Propidium Iodide or DAPI (for flow).
MOFA2 Software Core R package for model building and analysis. MOFA2 (Bioconductor).
Multi-omics Pre-processing Pipeline Standardize data from raw files to input matrices. Cell Ranger (10x), FlowJo/Cytobank (CyTOF), Seurat/Scanpy.
High-Performance Computing (HPC) Resource Train MOFA+ models on many factors/features. Local Linux cluster or cloud instance (e.g., AWS, GCP).

Application Notes

In the context of applying MOFA+ to immune cell heterogeneity in sepsis, Stage 3 is critical for deriving a robust, interpretable model. This phase determines whether the latent factors capture biologically relevant sources of variation, such as differences in patient outcomes, infection sources, or dynamic immune responses, rather than technical noise.

Key Considerations for Sepsis Research:

  • Convergence as Biological Stability: A converged model indicates that the multi-omics data (e.g., transcriptomics, proteomics, metabolomics from PBMCs) consistently point to the same underlying immune axes (factors). Non-convergence may suggest high patient-to-patient heterogeneity or confounding batch effects common in ICU cohorts.
  • Factor Interpretation: Post-training, factors are correlated with sample metadata. In sepsis, we anticipate factors associated with clinical severity (SOFA/APACHE II scores), immune phenotypes (e.g., immunosuppressive vs. hyperinflammatory), or outcomes (survivor vs. non-survivor).
  • Diagnostic Failures: Poor convergence or model fit can reveal data issues, such as dominant batch effects from multi-center sampling or highly sparse metabolomics data, which require strategic pre-processing.

Quantitative Diagnostics Table:

Diagnostic Metric Target Value Biological Interpretation in Sepsis Common Issue & Remedy
Evidence Lower Bound (ELBO) Must increase and stabilize over iterations. Indicates the model is successfully integrating omics layers to explain immune variation. Stagnation may require increased num_factors or review of data scaling.
Delta ELBO (Convergence Threshold) Default < 0.01. Model has found a stable representation of multi-omics immune states. Failure to converge may indicate extreme heterogeneity; subset analysis by patient group may be needed.
Variance Explained (R²) Factor-wise: >1% per factor. Total: Model should capture a significant portion of biological variance. Quantifies how much of the immune cell heterogeneity each factor explains. A "sepsis severity" factor should explain variance in key inflammatory genes. Low variance explained suggests strong unmodelled noise (e.g., cellular stress signatures); consider cell-type deconvolution as a covariate.
Factor Correlations Factors should be largely uncorrelated. High correlation suggests redundant factors; reduce num_factors.
Kullback–Leibler (KL) Divergence Should stabilize; high values indicate poor regularization. Measures prior-posterior divergence per factor. Stabilization indicates well-regularized latent spaces. Spiking KL for a factor suggests it models noise; increase sparsity settings.

Experimental Protocols

Protocol 1: MOFA+ Model Training for Sepsis Multi-Omics Data

Objective: Train a MOFA+ model on integrated multi-omics data from septic patient peripheral blood mononuclear cells (PBMCs) to identify latent factors of immune heterogeneity.

Materials:

  • Pre-processed multi-omics data matrices (e.g., RNA-seq, CyTOF, Olink proteomics).
  • Sample metadata table (Clinical scores, outcome, day post-onset).
  • R environment (v4.0+) with MOFA2 package installed.

Procedure:

  • MOFA Object Creation:

  • Data Options Configuration:

  • Model Options Configuration:

  • Training Options Configuration (Critical for Convergence):

  • Model Training:

Protocol 2: Comprehensive Convergence Diagnostics

Objective: Assess model training success and stability.

Procedure:

  • ELBO Trajectory Plot:

  • Calculate Delta ELBO:

  • Variance Explained Calculation:

  • Factor Correlation Analysis:

Protocol 3: Biological Interpretation in Sepsis Context

Objective: Correlate latent factors with clinical metadata to generate hypotheses.

Procedure:

  • Integrate Sample Metadata:

  • Correlation Plotting:

  • Feature Inspection: For a factor correlated with mortality, extract top-weighted features.

    Perform pathway enrichment analysis (e.g., using fgsea) on these genes.

Visualizations

workflow Data Pre-processed Multi-Omics Data Config Set Training Options (Convergence Mode, Factors) Data->Config Train Model Training (run_mofa) Config->Train ConvCheck Convergence Diagnostics (ELBO Plot, Delta ELBO) Train->ConvCheck Fail Adjust Parameters or Review Data ConvCheck->Fail Not Converged Interp Biological Interpretation (Variance Exp., Correlation) ConvCheck->Interp Converged Fail->Config Output Trained MOFA Model for Downstream Analysis Interp->Output

Model Training & Diagnostics Workflow

sepsis_factors Omics1 Transcriptomics Factor1 Factor 1 'Immunosuppression' Omics1->Factor1 Factor2 Factor 2 'Inflammatory Response' Omics1->Factor2 FactorX Factor N Omics1->FactorX Omics2 Proteomics Omics2->Factor1 Omics2->Factor2 Omics2->FactorX Omics3 Metabolomics Omics3->Factor1 Omics3->FactorX Meta1 High SOFA Score Factor1->Meta1 Meta2 Mortality Factor1->Meta2 Meta3 Day 3 Post-Onset Factor2->Meta3

MOFA+ Links Omics to Clinical Sepsis Features

The Scientist's Toolkit: Research Reagent Solutions

Item / Reagent Function in MOFA+ Sepsis Analysis
MOFA2 R Package Core software for multi-omics factor analysis and model training.
Seurat or SingleCellExperiment For initial processing and QC of single-cell or bulk transcriptomics data prior to MOFA+ input.
Olink Target 96/384 Panels Multiplex immunoassays for high-throughput, validated plasma protein biomarker quantification.
Maxpar Antibodies (for CyTOF) Metal-tagged antibodies for deep immune phenotyping via mass cytometry.
RNEasy Kits (Qiagen) Reliable RNA extraction from PBMCs for subsequent RNA-seq library prep.
CLIA-grade Clinical Metadata Database Structured collection of SOFA/APACHE scores, outcomes, and timelines for robust factor correlation.
fgsea R Package Fast gene set enrichment analysis for interpreting factor weights biologically.
High-Performance Computing (HPC) Cluster Essential for training MOFA+ models on large, multi-omics sepsis cohorts in a reasonable time.

Following the identification of latent factors through MOFA+ in a multi-omics sepsis dataset (e.g., transcriptomics, proteomics, cytometry), Stage 4 focuses on biological and clinical interpretation. This stage bridges statistical abstraction with tangible biology by correlating MOFA+ factors with annotated immune cell frequencies (from cytometry or deconvolution) and key clinical parameters (e.g., SOFA score, survival, infection source). The goal is to translate factors into hypotheses regarding immune cell dysregulation and patient stratification in sepsis.

Key Quantitative Data from Factor Interpretation

The following tables summarize typical correlation outputs from a MOFA+ analysis of sepsis multi-omics data.

Table 1: Top Factor-Immune Cell Subset Correlations (Example)

MOFA+ Factor Immune Cell Subset (Source: CyTOF/Flow) Correlation (r) p-value (adjusted) Proposed Biological Interpretation
Factor 1 Monocytic Myeloid-Derived Suppressor Cells (M-MDSCs) +0.85 1.2e-10 Myeloid Suppression & Immunoparalysis
Factor 2 Classical CD14++ Monocytes -0.72 3.5e-07 Depletion of Inflammatory Monocytes
Factor 3 CD8+ Effector Memory T Cells +0.68 2.1e-06 T Cell Exhaustion Signature
Factor 4 Neutrophils (CD66b+/CD16+) +0.91 5.0e-12 Neutrophil Activation & NETosis
Factor 5 Regulatory T Cells (Tregs) +0.61 8.7e-05 Immunosuppressive Regulation

Table 2: Factor-Clinical Feature Associations

MOFA+ Factor Clinical Feature Association Metric p-value Clinical Interpretation
Factor 1 (M-MDSC) 28-Day Mortality Hazard Ratio: 2.34 [1.5-3.6] 0.001 High factor score predicts mortality
Factor 2 (Monocyte Depletion) Sequential Organ Failure Assessment (SOFA) Score Spearman's ρ: +0.65 4.0e-05 Correlates with organ dysfunction
Factor 4 (Neutrophil) Source of Infection (Gram-negative vs. Gram-positive) t-test: t=4.1, df=45 0.0002 Higher in Gram-negative sepsis
Factor 5 (Treg) Secondary Infection Rate Odds Ratio: 3.1 [1.8-5.3] 0.002 Predicts nosocomial infection risk

Experimental Protocols for Validation

Protocol 3.1: Flow Cytometry Validation of Latent Factor-Associated Immune Cell Subsets

Purpose: To experimentally quantify the immune cell subsets identified as strongly loading on specific MOFA+ factors (e.g., M-MDSCs, Tregs).

Materials: See "Research Reagent Solutions" below. Procedure:

  • Sample Preparation: Thaw frozen PBMCs from septic patients and matched controls in complete RPMI medium. Count and adjust viability.
  • Surface Marker Staining: a. Prepare antibody cocktail in FACS buffer (PBS + 2% FBS). For M-MDSCs: anti-CD11b, CD14, CD15, HLA-DR; for Tregs: anti-CD3, CD4, CD25, CD127. b. Add 100µL of cell suspension (1x10^6 cells) to a V-bottom plate. Pellet and resuspend in 50µL antibody cocktail. c. Incubate for 30 minutes at 4°C in the dark. Wash twice with FACS buffer.
  • Fixation and Permeabilization (for intracellular markers like FoxP3 for Tregs): a. Fix cells with FoxP3 Transcription Factor Fixation/Permeabilization buffer for 30 min at 4°C. b. Wash with 1X Permeabilization Buffer. c. Stain intracellularly with anti-FoxP3 antibody (30 min, 4°C). Wash.
  • Acquisition and Analysis: a. Acquire data on a flow cytometer (e.g., BD Fortessa). Collect ≥100,000 events per sample. b. Analyze using FlowJo: gate on live singlets, then identify subsets (e.g., HLA-DRlowCD14+ for M-MDSCs; CD3+CD4+CD25+CD127lowFoxP3+ for Tregs). c. Correlate subset frequency with the corresponding MOFA+ factor scores using Spearman's rank correlation.

Protocol 3.2: Functional Validation of an Immunosuppressive Factor (e.g., Factor 1)

Purpose: To test the functional immunosuppressive capacity of cell subsets associated with a high-scoring factor (e.g., M-MDSCs from high Factor 1 patients). Procedure:

  • Isolation of M-MDSCs: Sort HLA-DRlowCD14+ cells from patient PBMCs using a FACS sorter (purity >90%).
  • T Cell Suppression Assay: a. Label CD3+ T cells (isolated from healthy donor PBMCs by magnetic negative selection) with CellTrace Violet. b. Co-culture labeled T cells (2x10^4) with titrated numbers of sorted patient M-MDSCs (effector:target ratios 1:2, 1:4, 1:8) in a 96-well round-bottom plate. c. Stimulate T cells with anti-CD3/CD28 beads. d. After 72-96 hours, harvest cells and analyze by flow cytometry for: i. Proliferation: Dilution of CellTrace Violet. ii. Activation: Surface expression of CD69, CD25. e. Calculate % T cell suppression: [1 - (proliferated T cells in co-culture / proliferated T cells alone)] * 100.
  • Correlation Analysis: Correlate the in vitro suppression index from each patient's M-MDSCs with the patient's Factor 1 score from MOFA+.

The Scientist's Toolkit: Research Reagent Solutions

Item (Catalog Example) Function in This Context
Human PBMCs (from sepsis cohorts) Primary cellular material for multi-omics analysis and validation.
Anti-human CD14, HLA-DR, CD11b, CD15 antibodies Surface staining for identification of myeloid-derived suppressor cell (MDSC) subsets via flow cytometry.
Anti-human CD3, CD4, CD25, CD127, FoxP3 antibodies Staining panel for identification and quantification of regulatory T cells (Tregs).
FoxP3 / Transcription Factor Staining Buffer Set Permeabilization and fixation for intracellular staining of key transcription factors (e.g., FoxP3).
CellTrace Violet Cell Proliferation Kit Fluorescent dye to track and quantify division of T cells in functional suppression assays.
Human T Cell Activation/Expansion Kit (anti-CD3/CD28 beads) Polyclonal stimulation of T cells for functional co-culture assays.
MOFA+ R/Bioconductor Package (v1.8+) Primary tool for multi-omics factor analysis and extracting factor weights/scores.
High-dimensional Flow Cytometer (e.g., Cytek Aurora) Instrument for deep immunophenotyping to validate cell subsets associated with latent factors.

Visualization Diagrams

Diagram 1: Stage 4 Interpretation Workflow

G cluster_inputs MOFA+ Model Outputs cluster_process Interpretation Steps cluster_data External Data for Linking cluster_output Interpreted Output F Latent Factors (Factor Scores & Weights) P1 1. Correlation Analysis F->P1 M Multi-omics Data (Input Matrix) M->P1 P2 2. Biological Annotation P1->P2 P3 3. Clinical Association P2->P3 P4 4. Hypothesis Generation P3->P4 O Biologically & Clinically Meaningful Factors P4->O I Immune Cell Frequencies I->P2 Correlate C Clinical Metadata (e.g., SOFA, Outcome) C->P3 Associate

Diagram 2: Validation of an Immunosuppressive Factor

G S High Factor 1 Score (M-MDSC Signature) A Isolate M-MDSCs (FACS Sort) S->A B Functional Assay: Co-culture with Healthy T Cells A->B C Readout: T Cell Proliferation & Activation B->C D Statistical Link: Correlate Suppression Index with Factor 1 Score C->D O Validated Mechanism: Factor 1 = Functional Immunosuppression D->O

Application Notes: Interpreting MOFA+ Factors in Sepsis Immunology

Following the application of MOFA+ to multi-omics data (e.g., scRNA-seq, ATAC-seq, proteomics) from sepsis patient immune cells, the model outputs latent factors that capture coordinated variance across modalities. This stage translates these statistical factors into biological insights.

Key Analytical Goals:

  • Driver Gene Identification: Link factor loadings to specific genes/motifs driving heterogeneity.
  • Pathway & Program Annotation: Determine which biological pathways and pre-defined gene sets (e.g., inflammatory response, metabolic reprogramming, anergy) are enriched in factor-weighted genes.
  • Cellular Program Attribution: Map factors back to cell populations (using factor values per sample/cell) to define cellular states prevalent in specific clinical phenotypes (e.g., septic shock vs. uncomplicated sepsis).

Quantitative Data Summary: Table 1: Representative Downstream Analysis Output for a Hypothetical MOFA+ Model on Sepsis PBMCs (3 Factors Shown).

Factor Variance Explained (RNA / ATAC) Top Driver Genes (RNA Loadings) Top Pathway Enrichment (FDR <0.05) Associated Clinical Phenotype (Correlation)
Factor 1 12% / 8% S100A8, S100A9, IL1B, CXCL8 GO:0006954 Inflammatory Response (FDR=1.2e-10) Hallmark: TNFα Signaling via NF-κB (FDR=3.5e-8) Positive correlation with SOFA score (r=0.72)
Factor 2 9% / 5% MT-CO1, MT-ND4, NDUFA4, COX7A1 GO:0022900 Electron Transport Chain (FDR=2.1e-9) Hallmark: Oxidative Phosphorylation (FDR=6.7e-7) Negative correlation with mortality (r=-0.61)
Factor 3 7% / 10% TOX, LAG3, TIGIT, PDCD1 GO:0031295 T Cell Costimulation (FDR=4.8e-6) Positive correlation with duration of ICU stay (r=0.58)

Experimental Protocols

Protocol 1: Annotation of Factors using Gene Set Enrichment Analysis (GSEA).

Objective: To test if genes ranked by MOFA+ loadings for a given factor show statistically significant enrichment in known biological pathways.

Materials: See "The Scientist's Toolkit" below.

Procedure:

  • Extract Loadings: For the target factor, extract the feature loadings for the RNA modality using get_weights(model, views="RNA").
  • Rank Genes: Sort genes by the absolute value of their loadings in descending order.
  • Prepare Gene Set Database: Download relevant gene sets (e.g., MSigDB Hallmarks, KEGG, custom sepsis/immunology sets) in .gmt format.
  • Run Pre-Ranked GSEA: Use the fgsea R package or GSEA software. Input the ranked gene list and gene set database. Set parameters: minSize=15, maxSize=500, nperm=10000.
  • Interpret Results: Filter results for Normalized Enrichment Score (NES) > |1.5| and False Discovery Rate (FDR) < 0.1. Leading-edge analysis identifies core enriched genes.

Protocol 2: Linking Factors to Cellular Programs via Factor Value Plotting.

Objective: To visualize the association between latent factors and cell-type or sample-level metadata.

Procedure:

  • Extract Factor Values: Obtain factor values for each sample (or cell, if using single-cell data) using get_factors(model).
  • Integrate with Metadata: Merge the factor values matrix with sample metadata (e.g., cell type annotation, patient clinical group, SOFA score).
  • Statistical Testing: For categorical metadata (e.g., cell type), perform Kruskal-Wallis test to assess if factor values differ between groups. For continuous metadata (e.g., SOFA), compute Spearman's rank correlation.
  • Visualization:
    • Boxplots: Plot factor values stratified by key metadata (e.g., cell type). Use ggplot2 in R or seaborn in Python.
    • Scatter Plots: Plot factor values against continuous clinical variables.
    • UMAP Overlay: For single-cell data, create a UMAP of cells and color points by their factor values to visualize spatial patterns of the learned cellular program.

Mandatory Visualization

G Start MOFA+ Model Output Step1 1. Extract Loadings & Rank Features Start->Step1 Step2 2. Functional Enrichment (GSEA, Overrep. Test) Step1->Step2 Step3 3. Integrate Factor Values with Metadata Step1->Step3 End Biological Interpretation: Driver Genes, Pathways, & Cellular Programs Step2->End Step4 4. Statistical Association (Test & Correlation) Step3->Step4 Step4->End

Downstream Analysis Workflow from MOFA+ Output

Pathway Enriched in Sepsis Factor: TLR4/NF-κB

The Scientist's Toolkit: Key Research Reagent Solutions

Table 2: Essential Resources for Downstream Analysis of MOFA+ Models in Immunology.

Item Function / Purpose Example Product / Resource
GSEA Software Performs gene set enrichment analysis on ranked gene lists. Broad Institute GSEA (v4.3) or fgsea R package.
Gene Set Databases Collections of curated biological pathways and signatures for annotation. MSigDB (Hallmarks, C7), KEGG, Reactome, custom sepsis gene sets.
Single-Cell Analysis Suite For integrating factor values with cell metadata and visualization. Seurat (R), Scanpy (Python).
Metadata Management Tool Critical for associating factor values with clinical/cellular phenotypes. Structured clinical data tables, pandas (Python), tidyverse (R).
Visualization Libraries Creates publication-quality plots of factor interpretations. ggplot2 (R), matplotlib/seaborn (Python), ComplexHeatmap (R).
Functional Enrichment Tools Web-based tools for quick validation of enrichment results. Enrichr, DAVID, Metascape.

Solving Real-World Challenges: Optimizing MOFA+ for Noisy, Clinical Sepsis Data

Application Notes: Integrating Sepsis Single-Cell Datasets with MOFA+

Sepsis is characterized by a dysregulated and highly heterogeneous immune response, making its study via single-cell genomics both essential and challenging. Research combining multiple cohorts is critical for robust biomarker discovery but introduces significant technical (batch) effects and confounding biological variability. MOFA+ (Multi-Omics Factor Analysis) is a statistical framework designed for the integration of multi-view data, capable of disentangling shared biological signals from dataset-specific technical artifacts and unwanted biological variation.

Key Quantitative Insights on Variability in Sepsis Studies

Table 1: Sources of Variability in Typical Sepsis Single-Cell Studies

Variability Type Primary Source Typical Impact (% Variance) MOFA+ Factor Classification
Technical Batch Sequencing lane, processing date, reagent lot 10-40% Dataset-specific factor(s)
Patient Biological Genetic background, comorbidities, age, sex 30-60% Patient-specific factor(s)
Sepsis Subtype Biology Immune phenotype (e.g., immunosuppressed, hyperinflammatory) 15-35% Shared factor(s) across datasets
Cell Type Proportion Differences in immune cell composition between patients 20-50% Can be captured by cell-type-specific factor loadings

Table 2: MOFA+ Model Diagnostics for Sepsis Data Integration

Model Parameter / Check Recommended Setting for Sepsis Rationale
Number of Factors Auto-detection (≥15 suggested) Captures multiple layers of biological and technical heterogeneity.
Variance Explained Threshold 2-5% per factor (min) Filters noise, focuses on meaningful sources of variation.
Batch Effect Correction Use "Group" argument for dataset ID; do NOT center groups. Explicitly models dataset as a covariate without removing inter-dataset biology.
Key Output Factor 1 (e.g., Major sepsis vs. control split) Shared across views (RNA, ATAC, etc.) and datasets.
Key Output Factor 2+ (e.g., Neutrophil activation, T cell exhaustion) May be shared or dataset-specific.
Key Output Final Factors (High-variance, patient-specific noise) Modeled as private patient effects.

Experimental Protocols

Protocol 1: Preprocessing and Input Data Preparation for MOFA+

  • Data Acquisition: Obtain single-cell RNA-seq (scRNA-seq) count matrices and metadata from ≥2 independent sepsis cohorts (e.g., public repositories: GSE167363, GSE212109).
  • Independent Normalization: For each dataset separately, perform standard preprocessing (QC, normalization, log1p transformation) using Seurat or Scanpy. Do not integrate or correct batches at this stage.
  • Common Feature Selection: Identify the union of highly variable genes (HVGs) from each dataset. Intersect this union with a pan-immune gene list to retain biologically relevant features, yielding ~5,000-10,000 shared features.
  • Data Object Creation: Create an h5mu (MuData) file or a list of matrices where each "view" is the expression matrix of the shared features, and each "group" is a distinct patient cohort.
  • Metadata Alignment: Ensure a unified metadata table with columns for: sample_id (patient), group (dataset origin), clinical_status (septic/control), outcome, and key demographics.

Protocol 2: MOFA+ Model Training and Evaluation

  • Model Initialization:

  • Model Options & Training:

  • Factor Diagnostics:

    • Plot variance explained per factor (plot_variance_explained).
    • Identify factors correlated with the group covariate (batch effects) vs. clinical_status (biological signal).
    • Inspract factor loadings and weights to annotate factors (e.g., "Neutrophil Factor", "Lymphocyte Factor", "Dataset-3 Batch Effect").

Protocol 3: Downstream Analysis on Corrected Data

  • Factor Value Extraction: Extract the matrix of factor values (samples x factors).
  • Regress Out Undesired Variance: Select factors identified as technical batch or unwanted patient-specific noise. Use their values as covariates in a linear model to adjust the original normalized expression data, creating a "residual" matrix.
  • Integrated Clustering: Use the batch-corrected residual matrix for downstream analysis (PCA, clustering, UMAP) across all cohorts simultaneously to identify conserved cell states.
  • Differential Analysis: Perform differential expression or pathway analysis on factor-adjusted residuals to identify sepsis-specific signatures robust to batch and patient variability.

Mandatory Visualization

MOFA_Workflow cluster_inputs Input: Multiple Sepsis Cohorts cluster_mofa MOFA+ Integration Model cluster_output Output: Disentangled Variation Dataset1 Cohort A (scRNA-seq) DataIn Grouped Data Input Dataset1->DataIn Dataset2 Cohort B (scRNA-seq) Dataset2->DataIn BatchLabel Technical & Biological Variability BatchLabel->DataIn FactorDecomp Dimensionality Reduction (Identify Latent Factors) DataIn->FactorDecomp Factors Shared Biology Batch/Noise Factors FactorDecomp->Factors SharedSig Robust Sepsis Signatures Factors->SharedSig  Interpret CorrectedData Batch-Adjusted Feature Matrix Factors->CorrectedData  Regress Out

Title: MOFA+ Workflow for Sepsis Data Integration

Variability_Decomposition TotalVariance Total Variance in Raw Data F1 Factor 1 Shared Biology (e.g., Sepsis vs. Control) TotalVariance->F1  Model Decomposition F2 Factor 2 Shared Biology (e.g., Neutrophil Module) TotalVariance->F2 F3 Factor N Dataset-Specific (Batch Effect) TotalVariance->F3 FN ... Patient-Specific (Noise) TotalVariance->FN

Title: MOFA+ Decomposes Total Data Variance

The Scientist's Toolkit

Table 3: Essential Research Reagents & Tools for Sepsis Single-Cell Analysis

Item / Resource Function in Context Example/Provider
10x Genomics Chromium High-throughput single-cell RNA-seq library preparation. Immune Profiling Solution
Cell Hashing Antibodies Multiplex patient samples in one run, reducing technical batch effects. BioLegend TotalSeq-A
Pan-immune Gene Panel Curated list for feature selection, focusing analysis on immune-relevant biology. MSigDB "Immune Signatures"
MOFA2 R/Python Package Core tool for multi-group, multi-view data integration. GitHub (bioFAM/MOFA2)
MuData / anndata Interoperable data structure for storing multimodal single-cell data. (muon-)scverse ecosystem
Seurat or Scanpy Standard toolkits for initial single-cell data QC, normalization, and HVG selection. Satija Lab / Theis Lab

Application Notes: Integrating Multi-Omics Data in Sepsis Immunomics

Within the broader thesis on applying MOFA+ to deconvolute immune cell heterogeneity in sepsis, selecting the correct number of latent factors (k) is the critical step that determines biological interpretability versus statistical noise. Under-fitting (too few factors) obscures genuine biological signal, collapsing distinct immune cell states. Over-fitting (too many factors) models technical noise, creating spurious, non-reproducible "cell states" that misdirect hypothesis generation.

Table 1: Quantitative Metrics for Optimal Factor Selection in MOFA+

Metric Description Ideal Value/Pattern in Optimal k Interpretation in Sepsis Context
ELBO (Evidence Lower Bound) Model evidence approximation. Maximum or plateau. Maximum integrated model likelihood for multi-omics (transcriptome, epigenome, proteome) sepsis data.
Variance Explained Total variance captured per factor. Last retained factor explains >1-2% variance. Ensures factors represent meaningful biological signal beyond technical noise.
Factor Correlations Correlation between factors. Low correlation (< 0.3 ). Indicates capture of orthogonal sources of variation (e.g., neutrophil activation vs. T-cell exhaustion).
Overshrinkage Percentage of features with zero variance. <50% for major omics layers. Confirms model is not collapsing; key sepsis response genes retain variance.
Reconstruction Error Error in predicting held-out data. Minimum error on test set. Validates model generalizability to new septic patient cohorts.

Experimental Protocols

Protocol 1: Systematic Determination of Optimalkin MOFA+

Objective: To identify the number of latent factors that maximizes biological insight while minimizing over/under-fitting for integrated sepsis multi-omics data.

Materials: Pre-processed multi-omics matrices (e.g., scRNA-seq, scATAC-seq, CITE-seq) from peripheral blood mononuclear cells (PBMCs) of septic patients and controls.

Procedure:

  • Model Training Spectrum: Train separate MOFA+ models across a range of k (e.g., 5 to 30 factors).
  • Convergence Check: For each k, run multiple iterations with different random seeds. Ensure ELBO convergence. Discard non-converged runs.
  • Metric Calculation: For each converged model, calculate metrics in Table 1. Use plot_variance_explained and plot_factor_correlations functions.
  • Stability Analysis: Perform cross-validation by randomly holding out 20% of data. Plot reconstruction error vs. k. Alternatively, use the subset_data function to create multiple data subsets, train models, and assess factor reproducibility (e.g., via correlation of factor weights).
  • Biological Plausibility Assessment: For candidate k values (e.g., at the ELBO plateau), manually inspect top feature loadings per factor. Validate that factors map to known immune cell types (e.g., CD14+ monocytes, NK cells) and sepsis-relevant processes (e.g., interferon response, endotoxin tolerance).
  • Final Selection: Choose the smallest k within the ELBO plateau that yields interpretable, stable, and biologically plausible factors with minimal correlation.

Protocol 2: Validation via External Septic Shock Cohort

Objective: To externally validate the selected optimal k.

Materials: An independent multi-omics cohort of septic shock patients.

Procedure:

  • Projection: Use the project_new_data function in MOFA+ to project the external data onto the model trained with the optimal k.
  • Variance Assessment: Calculate the variance explained in the external data. A significant drop suggests over-fitting to the original cohort's noise.
  • Correlation Analysis: Correlate the factor values of the projected data with key clinical sepsis phenotypes (e.g., SOFA score, 28-day mortality). Biologically relevant factors should show significant correlations.

Mandatory Visualization

k_selection_workflow Start Pre-processed Multi-omics Data Train Train MOFA+ Models (k = 5 to 30) Start->Train Metrics Calculate Metrics: ELBO, Variance Explained, Factor Correlation Train->Metrics Plateau Identify ELBO Plateau & Low Factor Correlation Metrics->Plateau BioCheck Biological Plausibility Assessment Plateau->BioCheck Candidate k BioCheck->Train Fail Select Select Optimal k BioCheck->Select Pass Validate External Cohort Validation Select->Validate

Title: Workflow for Selecting Optimal Number of Factors in MOFA+

over_under_fit Underfit Under-fitting (k too low) UnderfitConseq Consequences: - Collapsed cell states - Lost biological signal - Low variance explained Underfit->UnderfitConseq Optimal Optimal Fit OptimalConseq Consequences: - Distinct cell states - High interpretability - Generalizable model Optimal->OptimalConseq Overfit Over-fitting (k too high) OverfitConseq Consequences: - Noise as 'factors' - Non-reproducible - Poor external validation Overfit->OverfitConseq

Title: Consequences of Under-Fitting and Over-Fitting in MOFA+

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Materials for Sepsis Multi-Omics Analysis with MOFA+

Item Function in Experiment
10x Genomics Chromium Single Cell Immune Profiling Solution Enables simultaneous capture of transcriptome (GEX) and surface protein (CITE-seq) from single PBMCs, providing two key omics layers for MOFA+ integration.
Cell Ranger ARC Processing pipeline for single-cell multi-omics data (e.g., ATAC + GEX). Generates count matrices essential as input for MOFA+.
Seurat (v5+) or Scanpy Toolkits for initial QC, filtering, and basic clustering of single-cell data prior to MOFA+ integration.
MOFA+ (R/Python package) Core tool for multi-omics factor analysis. Performs dimensionality reduction and identifies latent factors driving variation across omics layers.
Sepsis patient PBMC samples (with controls) Primary biological material. Should be processed rapidly to preserve cell viability and RNA integrity for reliable multi-omics profiling.
CITE-seq Antibody Panel (Human Immune Cell Phenotyping) Pre-designed antibody panels against CD3, CD14, CD16, CD19, etc., allow protein-derived cell type annotation to validate MOFA+ factors.
Harmony or BBKNN Optional tools for batch correction that can be applied prior to MOFA+, especially when integrating data from multiple experimental runs or sepsis cohorts.

Within the broader thesis on MOFA+ application in sepsis immune cell heterogeneity research, a central challenge is the meaningful integration of multi-omics data (e.g., scRNA-seq, CyTOF, proteomics) across complex patient cohorts. Technical batch effects and biologically irrelevant variation often obscure true disease-relevant signals. This protocol details an optimization strategy for leveraging known sample covariates—such as clinical severity scores (SOFA, APACHE II) and sample source (blood, tissue)—to guide the MOFA+ integration process. This "guided integration" enhances model interpretability by ensuring latent factors align with biological and clinical axes of variation, rather than technical confounders.

Application Notes: The Role of Covariates in Sepsis MOFA+ Models

In sepsis, heterogeneity stems from patient demographics, infection source, pathogen, and evolving organ dysfunction. Directly modeling these known variables mitigates their confounding effect, allowing the model to isolate novel, latent sources of inter-patient immune variation.

  • SOFA Score as a Covariate: Used to guide factor identification associated with organ failure severity. This prevents severity-associated expression patterns from being dispersed across multiple technical factors.
  • Sample Source as a Covariate: Crucial when integrating peripheral blood mononuclear cells (PBMCs) with tissue-derived immune cells (e.g., from spleen biopsies). Guides the model to explicitly model source-specific effects, revealing tissue-resident vs. circulatory immune programs.
  • Time Point as a Covariate: In longitudinal sepsis studies, guiding by days post-admission stabilizes the model to identify recovery versus persistent inflammation trajectories.

Table 1: Impact of Covariate-Guided Integration on Model Performance

Metric Unguided MOFA+ Model SOFA & Source-Guided Model Interpretation
Variance Explained (R²) by Factor 1 12% (Multi-omics) 18% (Multi-omics) Guided model captures more coherent biological signal in top factor.
Alignment of Top Factor with SOFA Moderate (r=0.45) High (r=0.82) Guided factor strongly correlates with clinical severity.
Number of Factors Needed 10 8 Guided integration reduces number of factors needed to explain same total variance.
Batch Effect (Source) Residual Variance 15% in Factor 3 <5% in any factor Sample source variation is successfully modeled as a covariate, not a latent factor.

Experimental Protocols

Protocol 3.1: Pre-processing and Covariate Matrix Preparation

Objective: Prepare a normalized multi-omics data list and a corresponding sample covariates matrix for MOFA+.

Materials:

  • Seurat object (scRNA-seq) or FlowSOM object (CyTOF).
  • Clinical metadata table (CSV format).
  • R environment (v4.2+) with packages: MOFA2, tidyverse, Seurat.

Procedure:

  • Data Normalization: For each modality, generate a features-by-samples matrix.
    • scRNA-seq: From a Seurat object, extract log-normalized counts matrix (GetAssayData(slot = "data")).
    • CyTOF: Use arcsinh-transformed (co-factor=5) marker intensity matrix.
    • Proteomics: Use log-transformed LFQ intensity values.
  • Covariate Table Construction: Create a data.frame where rows are samples (matching column names in data matrices) and columns are covariates.
    • Encode categorical variables (e.g., Source: Blood=0, Tissue=1).
    • Scale continuous variables (e.g., SOFA_score) to zero mean and unit variance.
  • MOFA Object Creation:

Protocol 3.2: Training a Covariate-Guided MOFA Model

Objective: Train a MOFA+ model with sample covariates to guide factor inference.

Procedure:

  • Set Model Options: Configure the model to include covariates.

  • Prepare Covariates: Specify covariates in the model.

  • Build and Train the Model:

Protocol 3.3: Interpreting Covariate-Linked Factors

Objective: Identify factors associated with guided covariates and perform downstream analysis.

Procedure:

  • Factor-Covariate Correlation:

  • Differential Analysis on Factor Weights: Identify features driving covariate-associated factors.

  • Pathway Enrichment: Perform Gene Ontology (GO) enrichment on top weighted genes using the clusterProfiler package.

Visualizations

G cluster_pre 1. Input Preparation cluster_train 2. Guided Model Training cluster_out 3. Output & Interpretation title Guided Integration Workflow for Sepsis MOFA+ RNA scRNA-seq Matrix MOFAobj Create MOFA Object RNA->MOFAobj CyTOF CyTOF Matrix CyTOF->MOFAobj Meta Covariates (SOFA, Source) Meta->MOFAobj SetCov Set Covariates in Model Train Train MOFA+ Model SetCov->Train Factors Latent Factors (Covariate-Aligned) Train->Factors Weights Feature Weights per Factor Factors->Weights Enrich Pathway Enrichment Weights->Enrich

G title Covariate Impact on Latent Factor Resolution Unguided Unguided Integration F1 Factor 1 Mixed Signal Unguided->F1 F2 Factor 2 Batch Effect F1->F2 F3 Factor 3 Weak SOFA Link F2->F3 Guided Covariate-Guided Integration G1 Factor 1 Strong SOFA Link Guided->G1 G2 Factor 2 Source Effect G1->G2 G3 Factor 3 Novel Biology G2->G3 SOFA SOFA Covariate SOFA->G1 Source Source Covariate Source->G2

The Scientist's Toolkit

Table 2: Essential Reagents & Resources for Guided MOFA+ Analysis in Sepsis

Item / Resource Function / Description Example / Provider
MOFA2 R Package Core software for multi-omics factor analysis with covariate support. Bioconductor (bioc::MOFA2)
Seurat Pre-processing and analysis of single-cell RNA-seq data for input matrix generation. CRAN / Satija Lab
FlowSOM / CATALYST Pre-processing and clustering of CyTOF data for cell-type-specific matrix creation. Bioconductor
Clinical Metadata Table Structured CSV file containing sample-matched SOFA scores, source, batch, demographics. Essential lab record.
High-Performance Computing (HPC) Node MOFA+ model training is computationally intensive; requires adequate RAM and multi-core CPU. Local cluster or cloud (AWS, GCP).
ggplot2 & pheatmap R packages for visualizing factor values, weights, and correlations. CRAN
clusterProfiler R package for functional enrichment analysis of top-weighted genes from MOFA+ factors. Bioconductor

Dealing with Missing Data and Sparse Matrices in Cytokine or Surface Protein Measurements

Within the broader thesis on applying MOFA+ to elucidate immune cell heterogeneity in sepsis research, handling cytokine and surface protein data presents a significant challenge. High-dimensional single-cell technologies (e.g., mass cytometry, flow cytometry, multiplex immunoassays) generate datasets rife with missing values and inherent sparsity. This sparsity arises from technical dropouts, detection limits, and genuine biological absence. Proper management is critical for downstream multi-omics factor analysis (MOFA+) to avoid bias and extract biologically meaningful latent factors driving sepsis pathogenesis.

Table 1: Types of Missing Data in Cytokine/Surface Protein Measurements
Type of Missingness Description Common Cause in Immune Profiling Impact on MOFA+
Missing Completely at Random (MCAR) Missingness unrelated to observed or unobserved data. Technical errors, random pipetting failure. Minimal bias if handled properly, but reduces statistical power.
Missing at Random (MAR) Missingness depends on observed data. A low-abundance cytokine is more likely to be missing if overall cell signal is low. Can introduce bias if ignored; model-based imputation can help.
Missing Not at Random (MNAR) Missingness depends on the unobserved value itself. Protein level is below instrument detection limit (left-censoring). Most problematic; requires specific modeling (e.g., censored likelihood).
Structural Missing (True Zero) Genuine biological absence of a feature. Protein not expressed in a given cell type or state. Should be distinguished from technical dropouts; informative for the model.

Preprocessing and Imputation Protocols

Protocol 3.1: Data Preprocessing and QC for Cytokine Arrays

Objective: To prepare a cytokine concentration matrix (samples x cytokines) from multiplex assays (e.g., Luminex, Olink) for MOFA+ integration.

  • Raw Data Normalization: Apply plate-specific normalization using background subtraction and positive control scaling.
  • Log Transformation: Log-transform (base 2 or ln) all concentration values to stabilize variance.
  • Missing Value Flagging: Identify values below the assay's lower limit of detection (LLOD). Flag these as potentially MNAR.
  • Initial Filtering: Remove cytokines with >50% missingness (including LLOD) across the cohort.
  • Output: A normalized, log-transformed matrix with annotated missing values.
Protocol 3.2: Single-Cell Surface Protein Data Handling (CITE-seq/Flow Cytometry)

Objective: To process single-cell surface protein counts (e.g., ADT from CITE-seq) for MOFA+ integration with transcriptomics.

  • Background Correction: For CITE-seq data, use dsb (Denoised and Scaled by Background) or centered log-ratio (CLR) normalization.
  • Doublet Removal: Identify and remove cell multiplets using tools like scDblFinder.
  • Sparse Matrix Format: Maintain data in a sparse matrix format (e.g., .mtx, dgCMatrix) to efficiently store abundant zero counts.
  • Feature Selection: Retain antibodies with detectable expression in >1% of cells in at least one sample group.
  • Missing Value Coding: Code missing protein measurements as NA. Do not conflate zero counts (no antibody capture) with missing.
Protocol 3.3: Model-Based Imputation for MAR/MCAR Data usingsoftImpute

Objective: Impute missing values in a cytokine matrix where missingness is assumed to be MAR/MCAR.

  • Installation: Install the softImpute R package.
  • Matrix Preparation: Input your normalized, log-transformed matrix.
  • Rank Selection: Use the softImpute cross-validation function (cv.softImpute) to select the optimal rank (k) for the low-rank matrix approximation.
  • Imputation: Run softImpute with the selected rank and lambda parameter (regularization) to complete the matrix.
  • Reintegration: Use the completed matrix as input for MOFA+. Note: This is not recommended for MNAR data.

MOFA+ Specific Integration Strategy

MOFA+ inherently handles missing values by using a probabilistic framework. The following protocol details the optimal setup for sparse immune data.

Protocol 4.1: MOFA+ Model Configuration for Sparse Cytokine/Protein Data

Objective: Configure a MOFA+ model that appropriately models different types of missingness.

  • Data Input: Provide each modality (e.g., RNA, cytokines, surface proteins) as a list of matrices. Samples must be aligned across modalities.
  • Likelihood Specification:
    • For log-norm cytokine concentrations with MNAR (LLOD): Use a "censored" likelihood (likelihood = "gaussian" with censoring argument). Specify the lower threshold as the log-transformed LLOD.
    • For surface protein counts: Use a "poisson" likelihood for raw counts or "gaussian" for normalized, transformed data.
    • For RNA expression: Use a "gaussian" likelihood for log-counts.
  • Model Training: Run run_mofa() with increased convergence_mode ("slow") for complex, sparse data.
  • Factor Interpretation: Use the plot_data_overview() function to visualize the proportion of missingness per view and sample.

G RawCytokine Raw Cytokine Matrix (Conc. or Counts) PP1 Protocol 3.1: Normalize, Log, Flag LLOD RawCytokine->PP1 RawSC Raw Single-Cell Protein Counts PP2 Protocol 3.2: CLR/dsb, Sparse Format RawSC->PP2 Impute Optional: Imputation (e.g., softImpute) PP1->Impute For MAR/MCAR MOFA_Setup MOFA+ Setup Define Censored Likelihood for MNAR Data PP1->MOFA_Setup Direct for MNAR PP2->MOFA_Setup Impute->MOFA_Setup Model Trained MOFA+ Model (Latent Factors) MOFA_Setup->Model Interp Downstream Analysis Factor Interpretation Model->Interp

Title: Workflow for Sparse Immune Data in MOFA+

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Tools for Managing Sparse Immune Data
Item / Reagent Function / Purpose Example Product / Package
Multiplex Immunoassay Kits Simultaneously measure 30+ cytokines/chemokines from limited sample volume. Bio-Plex Pro Human Cytokine 48-plex, Olink Target 96
Cell Hashing Antibodies Multiplex samples in single-cell protocols, reducing batch effects and identifying doublets. BioLegend TotalSeq-C Antibodies
dsb R Package Normalizes and denoises CITE-seq/REAP-seq ADT data using background droplets. dsb (CRAN)
softImpute R Package Performs matrix completion via nuclear norm regularization for imputation of MAR data. softImpute (CRAN)
MOFA+ R/Python Package Integrates multi-omics data with built-in handling of missing values and sparsity. MOFA2 (Bioconductor)
Censored Likelihood Model Explicitly models MNAR data (e.g., values below LLOD) within a factor analysis framework. Implemented in MOFA2
Sparse Matrix Objects Efficient storage and computation for datasets with >90% zeros. R: dgCMatrix, Python: scipy.sparse.csr_matrix

Advanced Consideration: Signaling Pathway Analysis from Imputed Data

After MOFA+ factor extraction, inferred latent factors can be used to reconstruct complete data for pathway analysis. This is particularly useful for cytokine signaling networks.

Title: From Imputed Data to Pathway Analysis

Validation Protocol

Protocol 7.1: Validating Imputation and Integration Results

Objective: To ensure that data handling did not introduce artificial signals.

  • Downshift Experiment: Artificially introduce additional missingness (MCAR) into a complete subset of data. Apply the imputation/MOFA+ pipeline and compare the correlation between original and imputed values.
  • Negative Control Features: Include non-informative proteins/cytokines in the analysis. Their weights in MOFA+ factors should be near zero.
  • Biological Replication: Check if the MOFA+ factors separate known biological conditions (e.g., septic shock vs. healthy) and if relevant cytokine loadings align with known biology.
  • Comparison to Ground Truth: Where possible, compare imputed values for key cytokines with measurements from a validated, orthogonal assay (e.g., ELISA).

Within the broader thesis on applying MOFA+ (Multi-Omics Factor Analysis) to dissect immune cell heterogeneity in sepsis, large-scale cohort studies present unique computational challenges. The integration of high-dimensional data (e.g., scRNA-seq, CyTOF, proteomics) from hundreds of patients demands strategies for efficient data handling, model training, and interpretation. These application notes provide targeted protocols to ensure scalability, reproducibility, and performance.

Data Preprocessing & Dimensionality Reduction

Prior to MOFA+ modeling, effective preprocessing reduces computational load without sacrificing biological signal.

Protocol 1.1: Feature Selection for High-Dimensional Assays Objective: Reduce the number of input features for each omics layer to the most informative 5,000.

  • For scRNA-seq Data: Calculate variance-stabilizing transformation using scran. Select the top 5,000 genes with the highest biological component of variance.
  • For CyTOF/Proteomics Data: Apply an arcsinh transformation (cofactor=5). Select markers based on a) pre-defined biological importance and b) the top features by population variance.
  • Metadata Filtering: Exclude samples with >50% missing data per modality. Impute remaining missing values per feature using k-nearest neighbors (k=10) within each batch.

Quantitative Performance Impact:

Table 1: Preprocessing Impact on Runtime & Memory

Step Input Size (Example) Output Size Approx. Runtime RAM Requirement
Raw scRNA-seq (Cells x Genes) 50k cells x 30k genes 50k cells x 30k genes - High (20+ GB)
After Feature Selection 50k cells x 30k genes 50k cells x 5k genes 15 min Moderate (8 GB)
MOFA+ Model Training (10 Factors) 500 samples x 3 omics Converged model 2 hours 12 GB
Without Feature Selection 500 samples x 3 omics Model failed to converge >24 hours Out of Memory

Optimized MOFA+ Training Workflow

A structured training protocol prevents common bottlenecks.

Protocol 2.1: Staged Model Training for Large Cohorts Objective: Achieve a stable, converged MOFA+ model for 500+ patients with 3 omics layers.

  • Data Object Creation: Use create_mofa function. Ensure sample names are identical across omics matrices.
  • Model Options:
    • Set scale_views = TRUE to give equal weight to each data type.
    • Use likelihoods appropriate to data: "gaussian" for log-normalized counts, "poisson" for raw counts.
  • Staged Training Run:
    • Stage 1 (Low Precision): Run prepare_mofa with num_factors=15 (overestimate), seed=123, and maxiter=5. This is a fast, low-commitment run.
    • Stage 2 (Evaluate): Plot plot_variance_explained. If factors explain little variance (<2%), reduce num_factors for next run.
    • Stage 3 (High Precision): Using refined factor number (e.g., 10), run final training with convergence_mode="fast", maxiter=10000, and startELBO=1. Monitor ELBO convergence.
  • Model Caching: Save the fully trained model as an .rds file for downstream analysis.

Workflow RawData Raw Multi-Omic Data (500 samples x 3 layers) Preprocess Protocol 1.1: Feature Selection & Filtering RawData->Preprocess MOFA_Object Create MOFA Object (scale_views=TRUE) Preprocess->MOFA_Object Stage1 Stage 1: Low-Precision Run (15 factors, maxiter=5) MOFA_Object->Stage1 Evaluate Evaluate Variance Explained Plot & Drop Redundant Factors Stage1->Evaluate Stage2 Stage 2: High-Precision Run (10 factors, maxiter=10000) Evaluate->Stage2 ModelCache Cached Model (.rds) For Downstream Analysis Stage2->ModelCache

Diagram Title: Staged MOFA+ Training Workflow for Large Cohorts

Downstream Analysis & Interpretation

Efficient extraction of factors and associations is key.

Protocol 3.1: Batch-Corrected Factor Extraction Objective: Extract factors while accounting for technical cohort (e.g., sequencing batch).

  • Use get_factors(model, as.data.frame=TRUE) to extract factors.
  • Merge with sample metadata.
  • Fit a linear model: Factor ~ Clinical_Phenotype + Age + Gender + Batch. Use Batch as a random effect (e.g., lmer).
  • The residuals from this model are batch-corrected factor values for association testing.

The Scientist's Toolkit

Table 2: Key Research Reagent & Computational Solutions

Item Function & Rationale
MOFA+ (R/Python Package) Core tool for multi-omics integration. Identifies latent factors driving variation across all data types.
scran (R Package) Provides robust, fast variance estimation for scRNA-seq feature selection, critical for input size reduction.
Seurat (R Package) Alternative for scRNA-seq preprocessing, cell annotation, and can be used to generate input matrices for MOFA+.
Harmony (R Package) For batch integration prior to MOFA+ if severe batch effects are known. Can be run on PCs from each omic.
High-Performance Computing (HPC) Cluster Essential for large runs. Use SLURM job arrays to train multiple models (e.g., with different factor counts) in parallel.
RDS / HDF5 File Formats RDS for saving R objects (trained models). HDF5 back-end (rhdf5) for storing massive omics matrices on disk, not in RAM.
Conda/Docker Environments Ensure computational reproducibility by freezing package versions and OS dependencies for the entire analysis pipeline.

MOFA_Sepsis_Context Cohort Large Sepsis Cohort (Sepsis-2/3, Controls) MultiOmic Multi-Omic Profiling (scRNA-seq, Plasma Proteomics, Metabolomics) Cohort->MultiOmic MOFA_Input Preprocessed & Aligned Data Matrices MultiOmic->MOFA_Input MOFA_Model MOFA+ Model (Latent Factors 1...N) MOFA_Input->MOFA_Model Hetero Deconvolution of Immune Cell Heterogeneity MOFA_Model->Hetero Outcome Association with Clinical Outcomes & Endotypes MOFA_Model->Outcome Direct Association Hetero->Outcome

Diagram Title: MOFA+ in Sepsis Immune Heterogeneity Thesis

Benchmarking Success: Validating MOFA+ Findings Against Alternative Sepsis Analysis Tools

Application Notes

The study of immune cell heterogeneity in sepsis requires the integration of multi-omics data (e.g., scRNA-seq, surface protein, chromatin accessibility). This analysis compares four factor discovery/integration methods within the context of a thesis on MOFA+ application in sepsis research.

Table 1: Method Comparison for Multi-omics Factor Discovery

Feature PCA NMF Seurat Integration MOFA+
Core Objective Maximize variance in a single data set. Find parts-based, non-negative representation. Align shared cell states across datasets. Identify latent factors explaining variance across multiple omics.
Data Types Single matrix. Single non-negative matrix. Multiple matrices (same feature type). Multiple matrices (different feature types).
Integration Type Not applicable. Not applicable. Horizontal (same cells, different batches/conditions). Vertical (same cells, different omics) or Group (different groups, same omics).
Factor Interpretation Linear combinations of all features (global). Additive, parts-based combinations. Shared nearest neighbors graph. Sparse; factors can be active in subsets of omics and groups.
Handling Sparsity Poor. Moderate (implicitly encourages sparsity). Good (via graph-based methods). Explicitly modeled (sparsity priors).
Variance Decomposition Per dataset only. Per dataset only. Not directly provided. Quantifies % of variance explained per factor, per view, per group.
Output for Sepsis Major transcriptional programs. Co-regulated gene modules. A unified cell embedding correcting for batch. Latent factors linking, e.g., transcriptomic module X to protein Y, specific to a patient group.

Table 2: Illustrative Quantitative Output (Simulated Sepsis Data)

Latent Factor % Variance Explained (Transcriptome) % Variance Explained (Proteome) Top Feature Loadings (Transcriptome) Association with Clinical Group
Factor 1 12.5% 8.3% S100A8, S100A9, IL1B High in septic shock
Factor 2 7.1% 15.2% HLA-DRA, CD74, CIITA Low in non-survivors
Factor 3 5.3% 1.8% MS4A1, CD79A B cell signature, stable across groups
Factor 4 3.4% 4.9% PDCD1, LAG3, HAVCR2 High in immunosuppressed phase

Experimental Protocols

Protocol 1: Multi-omics Data Preprocessing for MOFA+

  • Input Data: Prepare m data matrices (views) for n shared cells/samples. For sepsis: View 1: scRNA-seq counts (genes x cells), View 2: ADT-derived surface protein counts (proteins x cells).
  • Normalization:
    • scRNA-seq: Log-normalize (e.g., log1p(CP10K)). Highly variable gene selection recommended.
    • Protein: Center log-ratio (CLR) normalization.
  • Group Definition: Create a sample metadata column defining groups (e.g., Septic_Shock, Sepsis, Control).
  • MOFA Object Creation: Use create_mofa() function, specifying data matrices and groups.
  • Model Options: Set training options (e.g., number of factors, sparsity).

Protocol 2: Comparative Analysis Workflow

  • Baseline PCA: Perform PCA on the log-normalized scRNA-seq matrix separately. Identify top PCs.
  • Baseline NMF: Apply NMF (e.g., using NMF R package) to the non-negative scRNA-seq matrix. Determine optimal rank k.
  • Seurat CCA Integration: Process scRNA-seq and ADT data as separate Seurat objects. Use FindIntegrationAnchors() (method = cca) and IntegrateData() to correct for batch/technique effect.
  • MOFA+ Training: Run run_mofa() on the prepared object. Use default ELBO convergence criteria.
  • Downstream Correlation: For all methods, correlate derived factors/embeddings with clinical metadata (e.g., SOFA score, survival). Validate key factors via known marker expression.

The Scientist's Toolkit: Key Research Reagent Solutions

Item Function in Sepsis Multi-omics Research
10x Genomics Feature Barcoding Enables simultaneous capture of transcriptome and surface protein (e.g., Immune Profile panel) from the same single cell.
Cell Hashing Antibodies (TotalSeq) Allows sample multiplexing, reducing batch effects and costs in patient cohort studies.
Cell Fixation & Permeabilization Kits Preserve cells for sorting/transport and enable intracellular protein staining if required.
MOFA+ R Package The primary tool for Bayesian multi-omics factor analysis and variance decomposition.
Seurat R Toolkit Standard for single-cell analysis, providing PCA, NMF, and CCA-based integration functions.
scran R Package Provides robust methods for normalization and highly variable gene detection prior to factor analysis.

Visualizations

G Start Multi-omics Input (scRNA-seq & ADT) PCA PCA (Single-view) Start->PCA NMF NMF (Single-view) Start->NMF Seurat Seurat CCA (Multi-batch) Start->Seurat MOFA MOFA+ (Multi-view/group) Start->MOFA Output Factor/Embedding Analysis PCA->Output NMF->Output Seurat->Output MOFA->Output

Title: Comparative Analysis Workflow for Factor Discovery

G Data RNA Matrix Protein Matrix Clinical Groups MOFA_model MOFA+ Model (Bayesian Framework) Data:rna->MOFA_model Data:adt->MOFA_model Data:meta->MOFA_model Output Latent Factors Weights per View Variance Explained MOFA_model->Output:factors MOFA_model->Output:weights MOFA_model->Output:vde VDE Group 1 Group 2 RNA Protein Factor 1 Factor 2 Factor 3 Output:vde->VDE Quantifies

Title: MOFA+ Model Input and Output Structure

G Factor MOFA+ Factor 1 (High in Septic Shock) View1 Transcriptome View Factor->View1 Explains 12.5% Var. View2 Proteome View Factor->View2 Explains 8.3% Var. Gene1 S100A8/9 High Loading View1->Gene1 Gene2 IL1B High Loading View1->Gene2 Protein1 CD14 High Loading View2->Protein1 Protein2 TREM1 Med Loading View2->Protein2

Title: MOFA+ Factor Links Features Across Omics

Application Notes

This protocol details the biological validation of Multi-Omics Factor Analysis+ (MOFA+) models within the context of sepsis research, focusing on immune cell heterogeneity. MOFA+ is a powerful unsupervised integration tool that decomposes multi-omics datasets into a set of latent factors and corresponding loadings. A primary challenge lies in interpreting these statistically derived factors biologically. This document provides a framework for correlating MOFA+ factors with flow cytometry-defined immune cell populations and clinical outcome measures to establish biological and clinical relevance.

Key Workflow: Following MOFA+ model training and factor selection, identified factors are correlated with:

  • High-dimensional flow cytometry data (e.g., manually gated population frequencies, cytometry-derived biomarkers like cellular activation states).
  • Clinical outcome parameters from septic patients (e.g., SOFA score, 28-day mortality, secondary infection rate).

Successful correlation validates that MOFA+ factors capture biologically meaningful immune cell variation with direct clinical implications, moving from statistical abstraction to mechanistic insight.

Protocols

Protocol 1: Post-MOFA+ Analysis for Biological Correlation

Objective: To extract and prepare MOFA+ factor values for downstream correlation with external biological and clinical data.

Materials:

  • Trained MOFA+ model (R object).
  • R environment with MOFA2 package installed.
  • Corresponding sample metadata table.

Methodology:

  • Factor Value Extraction: Use the get_factors(model) function to extract the factor matrix (samples x factors).
  • Sample Alignment: Ensure the row order of the factor matrix perfectly matches the sample order in the flow cytometry and clinical data tables. Use the sample IDs in the metadata for merging.
  • Dataframe Construction: Create a unified data frame with columns: Sample_ID, Factor1, Factor2, ... FactorN, and subsequent columns for flow and clinical variables.

Protocol 2: High-Dimensional Flow Cytometry Correlation

Objective: To quantify the relationship between MOFA+ factors and immune cell population frequencies or states.

Materials:

  • Processed flow cytometry data (CSV file) containing pre-computed population frequencies (% of parent or live cells) or Median Fluorescence Intensity (MFI) values for activation markers.
  • Unified correlation dataframe from Protocol 1.
  • Statistical software (R, Python, or Prism).

Methodology:

  • Data Integration: Merge the flow cytometry data with the MOFA+ factor dataframe by Sample_ID.
  • Correlation Analysis: For each factor and each flow cytometry variable of interest, perform a non-parametric correlation test (e.g., Spearman's rank correlation) due to potential non-normal distributions common in biological data.
    • In R: cor.test(df$FactorX, df$FlowVariable, method="spearman")
  • Multiple Testing Correction: Apply Benjamini-Hochberg False Discovery Rate (FDR) correction across all tested correlations for a given factor.
  • Visualization: Generate scatter plots for significant correlations (e.g., FDR < 0.05) with a regression line.

Protocol 3: Clinical Outcome Association

Objective: To assess the association between MOFA+ factors and patient clinical outcomes.

Materials:

  • Clinical data table with parameters: Sequential Organ Failure Assessment (SOFA) score (continuous), 28-day survival status (binary), demographics, source of infection.
  • Unified correlation dataframe from Protocol 1.

Methodology:

  • Continuous Outcomes (e.g., SOFA score): Perform linear regression or Spearman correlation.
    • lm(SOFA_day3 ~ Factor1 + Age + Gender, data=df)
  • Binary Outcomes (e.g., 28-day mortality): Perform logistic regression.
    • glm(Mortality_28day ~ Factor1 + APACHEII, data=df, family="binomial")
  • Survival Analysis: For time-to-event data (e.g., survival time), use Cox Proportional Hazards regression.
    • coxph(Surv(time, status) ~ Factor1, data=df)
  • Model Adjustment: Always include relevant clinical covariates (e.g., age, severity score at admission) in multivariate models to control for confounding.

Data Presentation

Table 1: Exemplary Significant Correlations between MOFA+ Factor 1 and Flow Cytometry Variables

Flow Cytometry Variable Cell Population / Marker Spearman's ρ P-value FDR-adjusted P-value
CD14+ HLA-DRlow/%Monocytes Immunosuppressive Monocytes 0.72 3.1e-05 0.0004
CD8+ PD-1+/%CD8 T cells Exhausted CD8+ T Cells 0.68 1.2e-04 0.0008
mDC_Lineage-/%Live Myeloid Dendritic Cell Frequency -0.61 0.0012 0.0060
CD4_IFNγ_MFI Th1 Functionality -0.55 0.0041 0.0150

Table 2: Association of MOFA+ Factor 2 with Clinical Outcomes in Sepsis Cohort (N=75)

Clinical Outcome Statistical Test Effect Estimate (95% CI) P-value Interpretation
SOFA Score (Day 3) Linear Regression (β) 1.85 (0.92 to 2.78) 0.0002 Higher Factor 2 → Higher Organ Dysfunction
28-Day Mortality Logistic Regression (OR) 3.45 (1.68 to 7.10) 0.001 Higher Factor 2 → 3.45x Odds of Death
Time to Secondary Infection Cox Regression (HR) 2.10 (1.25 to 3.52) 0.005 Higher Factor 2 → 2.1x Hazard Rate

Mandatory Visualization

G cluster_validation Biological & Clinical Validation Start Multi-Omics Input Data (RNA-seq, ATAC-seq, etc.) MOFA MOFA+ Model Training & Factor Extraction Start->MOFA Factors Latent Factors (Samples x Factors Matrix) MOFA->Factors Corr Statistical Correlation & Association Analysis Factors->Corr Factor Values Flow Flow Cytometry Data (e.g., Cell Freq, MFI) Flow->Corr Clinical Clinical Outcomes (e.g., SOFA, Mortality) Clinical->Corr Validation Validated Biological Factor with Clinical Prognosis Corr->Validation

Workflow for Correlating MOFA+ Factors with Experimental & Clinical Data

G cluster_flow Supporting Flow Cytometry Phenotypes cluster_clinical Associated Clinical Outcomes Factor1 MOFA+ Factor 1 (High Value) ImmuneState Identified Immune State: 'Myeloid Dysfunction & T Cell Exhaustion' Factor1->ImmuneState Interpreted As Pheno1 ↑ CD14+ HLA-DRlow/ Monocytes ImmuneState->Pheno1 Correlates With Pheno2 ↑ CD8+ PD-1+/ CD8 T Cells ImmuneState->Pheno2 Pheno3 ↓ mDC Frequency ImmuneState->Pheno3 Out1 ↑ SOFA Score ImmuneState->Out1 Predicts Out2 ↑ 28-Day Mortality ImmuneState->Out2 Out3 ↑ Secondary Infection ImmuneState->Out3

Biological Interpretation of a Validated MOFA+ Factor in Sepsis

The Scientist's Toolkit

Research Reagent / Material Function in Validation Workflow
MOFA2 (R/Python Package) Core tool for building the multi-omics integration model and extracting latent factors for validation.
Pre-conjugated Flow Cytometry Antibody Panels For staining immune cell surface/intracellular markers (e.g., CD14, HLA-DR, PD-1, lineage markers) to generate validation data.
Viability Dye (e.g., Zombie Aqua) Critical for excluding dead cells during flow cytometry analysis, ensuring data quality.
Flow Cytometry Standard (FCS) Files & Analysis Software (FlowJo, Cytobank) To process raw cytometry data, perform gating, and export population frequencies/MFI for correlation.
Clinical Database (REDCap, etc.) Secure repository for patient SOFA scores, mortality, and other outcomes needed for association studies.
R Statistical Environment with tidyverse, survival, lme4 packages For performing Spearman correlation, linear/logistic/Cox regression, and multiple testing correction.
High-Quality Nucleic Acid Extraction Kits For generating the RNA/DNA input for the original omics assays (RNA-seq, ATAC-seq) fed into MOFA+.
Single-Cell Multi-Omics Platforms (Optional) e.g., CITE-seq, to simultaneously measure transcriptomics and surface protein (antibody-derived tags) for orthogonal validation.

This Application Note details the protocol for applying Multi-Omics Factor Analysis (MOFA+) to a published sepsis atlas dataset. Within the broader thesis of 'MOFA+ Application in Immune Cell Heterogeneity Sepsis Research', this case study demonstrates how to disentangle the complex sources of variation—including patient-specific effects, immune cell population shifts, and inflammatory signaling states—from bulk or single-cell multi-omics data. The goal is to derive actionable biological factors and generate testable hypotheses for therapeutic intervention.

Key Published Datasets for Re-analysis

The following table summarizes relevant published datasets suitable for MOFA+ re-analysis.

Table 1: Published Sepsis Multi-omics Datasets for Re-analysis

Dataset Reference Data Types Cohort (n) Key Available Features Public Accession
Reyes et al., Sci. Immunol., 2020 Bulk RNA-seq, Cell surface protein (CITE-seq), Clinical metadata Sepsis: 29, Healthy: 15 Whole blood immunophenotyping, severity scores GSE167363
Scicluna et al., Nat. Commun., 2017 Bulk whole-blood RNA-seq, Clinical data Sepsis: 306 (discovery), 216 (validation) Transcriptional endotypes (Mars1/Mars2), mortality E-MTAB-4451
Sepsis Atlas (GSE65682) Bulk RNA-seq Sepsis: 479, Healthy: 42 Large cohort with longitudinal sampling, outcomes GSE65682
COVID-19 as Sepsis Model (Wilk et al., Nat. Med., 2021) scRNA-seq, scATAC-seq, Surface protein Critically Ill: 7, Mild: 6, Healthy: 4 Paired single-cell multi-omics, immune cell states PRJNA656838

MOFA+ Re-analysis Protocol: A Step-by-Step Guide

Experimental Workflow Diagram

G DataFetch 1. Data Fetch & Pre-processing MOFA_Train 2. MOFA+ Model Training DataFetch->MOFA_Train Formatted Data Matrices FactorAna 3. Factor Analysis & Interpretation MOFA_Train->FactorAna Trained Model (Latent Factors) Downstream 4. Downstream Integration FactorAna->Downstream Annotated Factors & Weights

Diagram Title: MOFA+ Re-analysis Workflow for Sepsis Data

Detailed Protocol Steps

Step 1: Data Acquisition & Pre-processing
  • Download Data: Access raw count matrices and metadata from repositories (e.g., GEO, ArrayExpress) using accession codes in Table 1.
  • Quality Control: Filter low-quality samples/genes. For RNA-seq: keep genes with >10 counts in >10% of samples. Normalize (e.g., variance stabilizing transformation for bulk, log1p for scRNA-seq).
  • Data Integration: Align multiple data views (e.g., RNA, ATAC) using common sample/cell identifiers. Handle missing data explicitly (MOFA+ supports missing views).
  • Format for MOFA+: Create a MultiAssayExperiment R object or individual matrices (samples x features) for each data modality.
Step 2: MOFA+ Model Training
  • Model Creation: model <- create_mofa(data_object)
  • Data Options: Set center/scale options per view: data_opts <- get_default_data_options(model)
  • Model Options: Define convergence criteria and stochastic inference (for large data): model_opts <- get_default_model_options(model)
  • Training Options: Set seed, GPU usage, verbose output: train_opts <- get_default_training_options(model); train_opts$seed <- 42
  • Run Training: model_trained <- run_mofa(model, data_options=data_opts, model_options=model_opts, training_options=train_opts)
  • Model Selection: Use automatic model selection or cross-validation to determine the optimal number of factors.

Table 2: MOFA+ Training Parameters for Sepsis Data

Parameter Recommended Setting Rationale
Number of Factors 10-15 (initial) Sufficient to capture clinical, batch, and biological heterogeneity.
Likelihoods "gaussian" (normalized data), "poisson" (counts) Match data distribution.
Convergence Mode "fast" (initial), "slow" (final) Balance speed vs. precision.
Drop Factor Threshold 0.02-0.03 Remove factors explaining negligible variance.
Step 3: Factor Interpretation & Annotation
  • Variance Decomposition: Plot variance explained per factor and view: plot_variance_explained(model_trained)
  • Factor Correlation: Correlate factors with clinical metadata (e.g., SOFA score, survival, infection source).

  • Feature Weight Analysis: Extract top-weighted genes/motifs per factor per view. Perform pathway enrichment (e.g., using fgsea on gene weights).

  • Factor Annotation: Synthesize correlations and top features to label each factor biologically (e.g., "Factor 1: Neutrophil Activation & Severity").
Step 4: Downstream Biological Integration
  • Cell State Deconvolution: Use factor values in conjunction with cell-type-specific gene signatures (e.g., from CIBERSORTx) to infer shifts in immune cell composition.
  • Trajectory Inference: Apply factors to pseudotime analysis (e.g., with Monocle3) to map patient progression.
  • Hypothesis Generation: Identify key driver genes at the intersection of high-weight features and dysregulated pathways.

Example Signaling Pathway from MOFA+ Factor Weights

A high-weight factor correlated with mortality may implicate specific inflammatory pathways.

signaling_pathway TLR4 TLR4 Activation MyD88 MyD88 TLR4->MyD88 Recruits NFKB NF-κB Activation MyD88->NFKB Activates InflamGenes IL1B, IL6, TNF Expression NFKB->InflamGenes Transcribes Outcome Association with Organ Dysfunction InflamGenes->Outcome Drives

Diagram Title: Inflammatory Signaling Pathway Implicated by a MOFA+ Factor

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Research Reagents & Tools for Sepsis MOFA+ Analysis

Item / Reagent Function in Analysis Example/Provider
R/Bioconductor Core statistical computing and MOFA+ package environment. R 4.3+, BiocManager
MOFA+ Package Primary tool for multi-omics factor analysis. bio.bioconductor.org/packages/MOFA2
SingleCellExperiment / MultiAssayExperiment Data containers for organizing multi-omics inputs. Bioconductor Packages
fgsea / clusterProfiler Perform Gene Set Enrichment Analysis on factor weights. Bioconductor Packages
CIBERSORTx Deconvolute cell-type proportions from bulk RNA-seq using signature matrices. cibersortx.stanford.edu
Seurat (If using single-cell data) Pre-processing, clustering, and integration. satijalab.org/seurat
Custom Sepsis Signature Gene Sets Curated lists for immune cell states, endotoxin response, etc. MSigDB, literature-derived
High-Performance Computing (HPC) Resources Essential for training models on large cohorts or single-cell data. Local cluster or cloud (AWS, GCP)

Application Notes & Protocols

Thesis Context: This protocol is a component of a broader thesis investigating immune cell heterogeneity in sepsis using the Multi-Omics Factor Analysis+ (MOFA+) framework. A critical step involves validating the stability and reproducibility of the identified latent factors and biological signatures across independent patient cohorts to ensure robust, translatable findings.

I. Protocol: Multi-Cohort Stability Analysis for MOFA+ Models

Objective: To assess the robustness of MOFA+ models by evaluating the consistency of latent factors (LFs) across multiple independent sepsis cohorts.

Materials & Experimental Setup:

  • Data: Pre-processed, batch-corrected multi-omics data (e.g., transcriptomics, proteomics, methylation) from at least two independent sepsis cohorts (e.g., publicly available ICU cohorts like MARS, SPROUT, or in-house collections).
  • Software: R/Python with MOFA2 package, statistical packages for clustering (e.g., stats, factoextra) and correlation (e.g., Hmisc).

Procedure:

  • Independent Model Training: Train separate MOFA+ models on each cohort using identical parameters (e.g., number of factors, sparsity options, convergence criteria).
  • Factor Matching: Compare factors across models based on:
    • Sample Loading Similarity: Correlate the factor loadings of samples that are matched by clinical phenotype (e.g., septic shock vs. non-septic shock).
    • Feature Weight Concordance: For each LF, calculate the correlation of gene/feature weights (top 500 weighted features) between cohorts.
  • Stability Metric Calculation: Generate a Factor Stability Matrix. For each LF in Cohort A, compute its maximum correlation with any LF in Cohort B.

Data Presentation: Table 1: Stability Matrix of Latent Factors (LFs) Between Cohort A and Cohort B

Cohort A LF Max Correlation with Cohort B LF (LF ID) Matched Phenotype Association
LF1 0.92 (LF2) Septic Shock Severity
LF2 0.87 (LF1) Lymphocyte Dysfunction
LF3 0.45 (LF5) Metabolic Reprogramming
LF4 0.91 (LF4) Neutrophil Activation
LF5 0.39 (N/A) (Cohort-specific artifact)

Interpretation: LFs with cross-cohort correlations >0.8 are considered highly reproducible. LFs with correlations <0.5 may represent cohort-specific technical variation or unique biology requiring further scrutiny.

II. Protocol: Reproducibility Assessment of Biological Signatures

Objective: To validate the biological interpretation of reproducible LFs through independent pathway analysis and deconvolution.

Procedure:

  • Pathway Enrichment Concordance: Perform over-representation analysis (ORA) or gene set enrichment analysis (GSEA) on the gene weights from matched LFs (e.g., Cohort A LF1 & Cohort B LF2). Use databases like Reactome, MSigDB Hallmarks.
  • Cell Type Deconvolution Validation: Use the transcriptomic loadings of an "immune cell" LF as a signature matrix input for an independent deconvolution tool (e.g., CIBERSORTx, MuSiC) applied to bulk RNA-seq from a third validation cohort.
  • Correlation with External Clinical Biomarkers: Correlate the sample loadings of validated LFs with orthogonal clinical laboratory values (e.g., plasma IL-6, lactate, monocyte HLA-DR expression via flow cytometry) available in the independent cohorts.

Data Presentation: Table 2: Reproducibility of Top Enriched Pathways for Matched LF1/LF2 Across Cohorts

Pathway Name (MSigDB Hallmark) Cohort A FDR Cohort B FDR Concordance Status
INFLAMMATORY RESPONSE 2.5E-12 3.7E-09 High
INTERFERON GAMMA RESPONSE 1.8E-08 4.1E-06 High
OXIDATIVE PHOSPHORYLATION 0.67 0.72 Not Significant

Mandatory Visualization

MOFA_Robustness_Workflow CohortA Cohort A Multi-omics Data ModelA MOFA+ Model A CohortA->ModelA CohortB Cohort B Multi-omics Data ModelB MOFA+ Model B CohortB->ModelB Match Factor Matching (Loadings & Weights) ModelA->Match ModelB->Match StabilityMatrix Stability Matrix (Table 1) Match->StabilityMatrix BioValid Biological Validation (Pathways, Deconvolution) StabilityMatrix->BioValid RobustLF Set of Robust, Reproducible Latent Factors BioValid->RobustLF

Workflow for Multi-Cohort Robustness Assessment

Factor_Stability_Logic Start Evaluate Latent Factor Q1 Cross-cohort correlation > 0.8? Start->Q1 Q2 Pathway enrichment concordant? Q1->Q2 Yes Investigate REQUIRES INVESTIGATION Check for batch effects or unique biology Q1->Investigate No Q3 Links to orthogonal clinical biomarkers? Q2->Q3 Yes Q2->Investigate No Robust ROBUST FACTOR Include in final model Q3->Robust Yes Q3->Investigate No Artifact LIKELY ARTIFACT / NOISE Consider removal

Logic Tree for Assessing Factor Robustness

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials for Sepsis Multi-Omics Robustness Studies

Item / Reagent Function & Application Note
MOFA2 (R/Python Package) Core tool for multi-omics integration. Use for training separate models on each cohort.
CIBERSORTx Independent digital cytometry tool for validating immune cell heterogeneity signatures from MOFA+ LFs.
MSigDB Hallmark Gene Sets Curated biological pathways for enrichment analysis to confirm interpretative reproducibility.
Batch Effect Correction Tools (ComBat, Harmony) Critical for pre-processing cohorts independently before MOFA+ analysis to mitigate technical confounders.
Flow Cytometry Antibody Panel (e.g., HLA-DR, CD14, CD64) For orthogonal validation of monocyte/immune cell states predicted by MOFA+ factors.
Multiplex Immunoassay (e.g., Luminex) To measure plasma cytokine levels (IL-6, IL-10, etc.) for correlation with inflammatory LFs.

Application Notes

This document provides guidance for researchers in immunology and sepsis on selecting the Multi-Omics Factor Analysis+ (MOFA+) framework for multi-view data integration. Its application is contextualized within a thesis investigating immune cell heterogeneity in sepsis to identify novel prognostic biomarkers and therapeutic targets.

Key Strengths of MOFA+

  • Interpretable Latent Factors: MOFA+ models multi-omics data as a linear combination of latent factors that capture shared and specific sources of variation. This is critical for sepsis research, where factors may correspond to distinct immune cell states, pathways, or patient subtypes driving heterogeneity.
  • Handling Sparsity and Noise: It is robust to common challenges in biological data (e.g., dropouts in scRNA-seq, missing values in proteomics), making it suitable for complex clinical datasets from septic patients.
  • Flexibility in Data Types: Seamlessly integrates continuous (gene expression, cytokine levels), binary (mutation status), and count (ATAC-seq peaks) data without need for pre-harmonization.
  • Model Selection & Dimensionality: Provides statistical criteria (e.g., ELBO) to determine the optimal number of factors, preventing overfitting—a key advantage when analyzing high-dimensional data from limited patient cohorts.
  • Downstream Analysis Ready: Outputs are directly amenable to association with clinical traits (e.g., SOFA score, survival), pathway enrichment, and driver feature identification.

Recognized Limitations of MOFA+

  • Linear Assumption: The linearity of the model may fail to capture complex, non-linear interactions prevalent in immune signaling networks.
  • Factor Interpretability: While factors are interpretable, their biological meaning is not automatic and requires expert-driven validation—a resource-intensive step in sepsis studies.
  • Scalability: For extremely large single-cell datasets (>100k cells), computational runtime and memory can become limiting compared to some deep learning alternatives.
  • Temporal Dynamics: Standard MOFA+ does not explicitly model time-series data, a common design in longitudinal sepsis studies tracking immune reconstitution.

Decision Framework: MOFA+ vs. Other Methods

The following table summarizes quantitative and qualitative comparisons to guide method selection.

Table 1: Comparative Analysis of Multi-Omics Integration Methods

Method Core Approach Best for When Your Sepsis Study Requires... Key Limitation vs. MOFA+
MOFA+ Probabilistic, factor analysis Interpretable latent drivers; mixed data types; handling missing data. N/A (Baseline)
WNN (Seurat) Weighted nearest neighbors Primary goal is cell clustering/annotation from scRNA+scATAC; cellular resolution. Less global view of shared factors; limited to cell-level data.
Structural Equation Models Causal path modeling Testing predefined causal hypotheses between omics layers and clinical outcome. Requires strong prior knowledge; less exploratory.
Deep Learning (e.g., DCA) Non-linear autoencoders Capturing complex, non-linear interactions; denoising data. "Black-box" nature reduces interpretability of integrated axes.
Regularized Canonical Correlation Analysis (rCCA) Maximizing correlation Focusing strictly on correlations between two predefined omics views. Hard to scale >2 views; factors are not necessarily shared across all views.
Integration-Driven Clustering (e.g., COCOS) Joint clustering Directly assigning patients/cells to clusters without factor decomposition. Loss of continuous variation information crucial for heterogeneity spectra.

Experimental Protocols

Protocol: MOFA+ Workflow for Sepsis PBMC Multi-Omics Data

Objective: To integrate transcriptomic (RNA-seq) and epigenomic (ATAC-seq) data from septic patient PBMCs to identify coordinated immune regulatory programs.

Materials:

  • Input Data: Matrices (cells/patients x features) for each assay. Example: scRNA-seq (log-normalized counts) and scATAC-seq (peak accessibility counts) from the same patient cohort.
  • Software: R (≥4.0) with MOFA2 package installed.

Procedure:

  • Data Preparation:
    • Format: Create a list where each element is a matrix corresponding to an omics view. Ensure samples (cells/patients) are aligned across views (matched design).
    • Feature Selection: Reduce dimensionality per view to manage runtime and noise.
      • RNA-seq: Select top 5,000 highly variable genes.
      • ATAC-seq: Select top 25,000 most accessible peaks.
    • Center Data: Center features to have zero mean.
  • MOFA Object Creation & Training:

  • Model Inspection & Downstream Analysis:

    • Variance Explained: Use plot_variance_explained(mofa_model) to assess contribution of factors to each view.
    • Factor-Trait Association: Correlate factor values with clinical traits (e.g., correlate_factors_with_covariates).
    • Feature Weights: Extract and interpret top-weighted genes/peaks per factor for biological insight (e.g., pathway enrichment on top RNA weights).

Protocol: Validation via Flow Cytometry

Objective: Validate a latent factor identified as associated with monocyte dysregulation.

Materials:

  • PBMC samples from the same septic patient cohort.
  • Antibody panel: CD14, CD16, HLA-DR, PD-L1, CD86.
  • Flow cytometer and analysis software (e.g., FlowJo).

Procedure:

  • Stain fresh or cryopreserved PBMCs with the antibody panel.
  • Acquire data on flow cytometer, collecting ≥100,000 events per sample.
  • Gate live, single cells, then identify monocyte subsets (classical: CD14++CD16-, intermediate: CD14++CD16+, non-classical: CD14+CD16++).
  • Calculate the geometric mean fluorescence intensity (gMFI) of HLA-DR and PD-L1 on each subset.
  • Statistically correlate the gMFI of HLA-DR on classical monocytes with the sample-level values of the MOFA+ factor of interest using Spearman's rank correlation.

Visualizations

workflow Data1 scRNA-seq Matrix (Patients x Genes) MOFA MOFA+ Model Training (Infer Latent Factors) Data1->MOFA Data2 scATAC-seq Matrix (Patients x Peaks) Data2->MOFA Data3 Clinical Traits (e.g., SOFA Score) Data3->MOFA Factor1 Factor 1 (e.g., Myeloid Dysfunction) MOFA->Factor1 Factor2 Factor 2 (e.g., T Cell Exhaustion) MOFA->Factor2 FactorN Factor N MOFA->FactorN Assoc Trait Association Analysis Factor1->Assoc Weights Feature Weights (Genes/Peaks) Factor1->Weights Output1 Factor-Trait Correlation Plot Assoc->Output1 Output2 Pathway Enrichment Results Weights->Output2

MOFA+ Analysis Workflow for Sepsis Multi-Omics

decision Start Start: Multi-omics Integration Goal Q1 Primary need is interpretable, global latent factors? Start->Q1 Q2 Data includes >2 views or mixed data types (binary/continuous)? Q1->Q2 No M1 CHOOSE MOFA+ Q1->M1 Yes Q3 Key aim is causal inference with strong priors? Q2->Q3 No Q2->M1 Yes Q4 Dataset is extremely large (>100k single cells)? Q3->Q4 No M2 Consider Structural Equation Models Q3->M2 Yes M3 Consider Deep Learning Methods (e.g., DCA) Q4->M3 Yes M4 Consider WNN (Seurat) or rCCA Q4->M4 No Yes Yes No No

Decision Tree for Choosing MOFA+

The Scientist's Toolkit

Table 2: Essential Research Reagents & Solutions for MOFA+ in Sepsis Immunology

Item Function in the Workflow Example/Provider
CITE-seq Antibody Panel Simultaneous surface protein quantification with transcriptomics at single-cell level. Provides a direct multi-modal view for integration. TotalSeq (BioLegend), Feature Barcoding (10x Genomics)
Cell Hashing Reagents Enables sample multiplexing, reducing batch effects—a critical pre-processing step for clean MOFA+ input. Hashtag Antibodies (BioLegend)
Nuclei Isolation Kit For generating high-quality nuclei from frozen tissue (e.g., spleen) for snRNA-/ATAC-seq in sepsis biobanks. Nuclei EZ Lysis (Sigma), 10x Nuclei Isolation Kit
CRISPR Screen Library To functionally validate MOFA+-identified driver genes in immune cell activation pathways. Myeloid-focused sgRNA library (e.g., Brunello)
Cytokine Multiplex Assay Profiles soluble immune mediators (serum/plasma) as an additional 'omics' view or for trait correlation. Luminex xMAP Assay, Olink Target 96
MOFA2 R Package Core software for statistical modeling and analysis of multi-view data. https://biofam.github.io/MOFA2/
Seurat R Toolkit For single-cell data pre-processing, WNN integration (comparison), and visualization of MOFA+ outputs. https://satijalab.org/seurat/
Harmonization Software Batch correction prior to MOFA+ if strong technical confounding is present (use cautiously). Harmony, ComBat

Conclusion

MOFA+ represents a powerful, flexible framework for moving beyond descriptive catalogs of immune cells in sepsis towards a mechanistic understanding of the coordinated, multi-omic programs driving heterogeneity and patient outcomes. By effectively integrating disparate data types, it uncovers latent factors that correspond to novel cell states, dysfunctional pathways, and patient endotypes, providing a robust data-driven foundation for target identification. Future directions involve tighter integration with temporal data to model immune trajectory, application to clinical trial stratification, and the development of interpretable AI models built upon MOFA+-derived factors. For drug developers, this approach promises to deconvolve sepsis complexity, revealing precise, tractable intervention points for next-generation immunotherapies.