Cell2Net Model Interpretation: Transcription Factor Attribution#

This tutorial demonstrates how to interpret trained Cell2Net models by analyzing transcription factor (TF) contributions to gene expression predictions. Using advanced attribution methods, we identify which transcription factors are most important for regulating each target gene across different immune cell types.

Overview#

Cell2Net model interpretation reveals the regulatory logic learned during training by:

TF Attribution Analysis: Quantifying how much each transcription factor contributes to gene expression predictions
Cell Type Specificity: Analyzing TF importance across different PBMC cell populations
Regulatory Networks: Constructing TF-to-gene regulatory relationships from model interpretations
Biological Validation: Connecting computational predictions to known regulatory mechanisms

Biological Significance#

TF attribution analysis provides insights into:

Regulatory Hierarchies: Which TFs are master regulators vs. downstream effectors
Cell Type Programs: How different immune cells use distinct TF combinations
Therapeutic Targets: Key regulatory nodes for intervention strategies

Technical Framework#

We employ Integrated Gradients attribution method because:

Model-agnostic: Works with any differentiable model architecture
Faithful attribution: Satisfies completeness and sensitivity axioms
Gradient-based: Efficiently computes feature importance through backpropagation
Baseline comparison: Measures importance relative to neutral reference state

import warnings
warnings.filterwarnings("ignore")

import os
import numpy as np
import mudata as md
import cell2net as cn
from tqdm import tqdm
import pandas as pd
md.set_options(pull_on_update=False)

<mudata._core.config.set_options at 0x7f4f9b24b860>

2. Setup File Paths and Output Directory#

Configure input and output directories for TF attribution analysis:

data_dir: Path to prepared multiome dataset with peak-to-gene associations
in_dir: Directory containing trained Cell2Net models from previous tutorial step
out_dir: New directory for storing TF attribution results and regulatory networks

This organized structure ensures:

Computational reproducibility: Clear tracking of model inputs and interpretation outputs
Result organization: Systematic storage of attribution matrices and TF-gene relationships
Analysis pipeline: Seamless integration with downstream regulatory network analysis

data_dir = "./02_prepare_data/mdata.h5mu"
in_dir = "./03_train_cell2net"
out_dir = "./05_to_gene"

os.makedirs(out_dir, exist_ok=True)

3. Load Multiome Dataset#

Load the prepared multiome dataset containing:

RNA expression: Gene expression measurements across PBMC cell types
ATAC accessibility: Chromatin accessibility profiles for regulatory regions
Peak-to-gene associations: Regulatory links between accessible peaks and target genes
Sequence information: DNA sequences around peaks for motif analysis
Cell annotations: Cell type labels for cell-type-specific TF analysis

This dataset provides the foundation for interpreting how Cell2Net models learned cell-type-specific regulatory relationships from sequence and accessibility features.

mdata_bulk = md.read_h5mu(data_dir)

4. Extract Target Gene List#

Extract the list of genes that have trained Cell2Net models available for interpretation:

Gene selection: Only genes with sufficient peak-to-gene associations and trained models
Regulatory focus: Genes with well-characterized regulatory regions in the PBMC dataset
Model availability: Ensures consistency between training and interpretation phases

Each gene in this list represents a regulatory target with learned Cell2Net models that can be interpreted to reveal transcription factor contributions and cell-type-specific regulatory mechanisms.

genes = mdata_bulk.uns['peak_to_gene']['gene'].unique().tolist()

5. Transcription Factor Attribution Analysis#

This is the main computational loop that performs TF attribution analysis for each trained Cell2Net model:

Model Loading and Setup#

For each gene, we:

Initialize Cell2Net model with the same architecture used during training
Load trained weights from the saved model checkpoints
Transfer to GPU for efficient gradient computation during attribution

Attribution Computation#

The cn.ip.tf_attr() function implements Integrated Gradients attribution:

n_steps=100: Number of integration steps for stable gradient estimation
multiply_by_inputs=True: Scale attributions by input values for interpretability
batch_size=2: Process multiple samples simultaneously for efficiency

Technical Details#

Integrated Gradients Method:

Computes gradients along a straight path from baseline (neutral) to actual input
Satisfies axioms of completeness (attributions sum to prediction difference)
Provides stable, robust feature importance scores

Expression-Level Analysis:

Attributions computed for each TF motif across all regulatory peaks
Results capture how TF expression contribute to gene expression
Cell-type-specific analysis reveals regulatory context dependence

Output Generation#

For each gene, we save:

Raw attributions (.npy): Full attribution matrices for detailed analysis
TF-gene relationships (.csv): Top TFs per cell type with importance scores

Biological Interpretation#

The attribution scores reveal:

Positive attributions: TFs that promote gene expression when bound
Negative attributions: TFs that repress gene expression or compete for binding
Cell-type specificity: Different TF importance across immune cell populations
Regulatory logic: How combinations of TFs work together to control genes

This systematic analysis reveals the regulatory code learned by Cell2Net models and provides mechanistic insights into immune cell gene regulation.

for gene in tqdm(genes):
    if os.path.exists(f"{out_dir}/{gene}.npy"):
        continue
    
    model = cn.pd.model.Cell2Net(mdata=mdata_bulk, 
                                 gene=gene, 
                                 covariates=['total_counts_rna_log', 'total_counts_atac_log'])

    model.load(dir_path=f"{in_dir}/model")
    model.to_device('cuda:0')
    
    tf_attr = cn.ip.tf_attr(model, 
                            batch_size=2,
                            n_steps=100,
                            multiply_by_inputs=True)
    
    df = cn.ip.tf_to_gene(model.mdata, tf_attr, groupby="cell_type_v2", n_tfs=10)
    np.save(f"{out_dir}/{gene}.npy", tf_attr)
    df.to_csv(f"{out_dir}/{gene}.csv", index=False)

6. Compile TF-Gene Regulatory Networks#

Aggregate individual gene attribution results into a comprehensive regulatory network dataset:

Data Integration#

Load individual results: Read TF attribution CSV files for each analyzed gene
Concatenate datasets: Combine all TF-gene relationships into single dataframe
Network construction: Build comprehensive regulatory network from attribution scores

Network Structure#

The compiled dataset contains:

Gene: Target gene symbols
TF: Transcription factor names from JASPAR2024 database
Cell_type: PBMC cell type where regulation occurs
Attribution_score: Quantitative measure of TF importance for gene regulation
Rank: TF importance ranking within each cell type-gene combination

This integrated network provides a systems-level view of transcription factor regulation across the PBMC immune system.

df_list = []
for gene in genes:
    df = pd.read_csv(f"{out_dir}/{gene}.csv")

    df_list.append(df)
df_p2g = pd.concat(df_list).reset_index(drop=True)

Consolidate Network Data#

Create the final consolidated regulatory network by combining all individual gene attribution results into a single comprehensive dataset for analysis and visualization.

df = pd.concat(df_list)

Inspect the Regulatory Network#

df

	tf	gene	cell_type_v2	mean_attr	std_attr
0	KLF2	ISG15	B cell	1.415844	0.468231
1	ELF1	ISG15	B cell	0.975179	0.298402
2	MEF2C	ISG15	B cell	0.851608	0.448217
3	IKZF3	ISG15	B cell	0.699268	0.239014
4	JUNB	ISG15	B cell	0.691353	0.323894
...	...	...	...	...	...
75	ARID3A	CLIC2	pDC	0.167850	0.062825
76	MAX	CLIC2	pDC	0.154308	0.050194
77	CREB3L2	CLIC2	pDC	0.151772	0.049553
78	MGA	CLIC2	pDC	0.150544	0.062157
79	JUNB	CLIC2	pDC	0.141468	0.071961

154160 rows × 5 columns

7. Save Comprehensive Regulatory Network#

Export the complete TF-gene regulatory network to CSV format for downstream analysis and biological interpretation:

Saved Dataset Structure#

Comprehensive coverage: All TF-gene relationships across PBMC cell types
Quantitative scores: Attribution-based importance measures for each regulatory edge
Cell-type resolution: Context-specific regulatory relationships for each immune cell population
Standardized format: Ready for network analysis, visualization, and biological validation

Applications of the Regulatory Network#

Biological Discovery:

Master regulators: Identify TFs with broad regulatory influence across immune genes
Cell-type programs: Understand how different TFs drive cell-type-specific expression
Regulatory modules: Group genes by shared TF regulatory patterns
Disease mechanisms: Connect regulatory disruptions to immune disorders

Computational Analysis:

Network topology: Analyze regulatory network structure and connectivity
Pathway enrichment: Link TF targets to biological pathways and processes
Comparative analysis: Compare regulatory networks across conditions or diseases
Predictive modeling: Use regulatory relationships for gene expression prediction

Experimental Design:

Target prioritization: Select key TFs for experimental validation
Perturbation experiments: Design TF knockout/overexpression studies
Drug discovery: Identify regulatory nodes for therapeutic intervention
Biomarker development: Use TF activity signatures for cell state classification

This comprehensive regulatory network represents the interpretable regulatory logic learned by Cell2Net models and provides a valuable resource for understanding immune cell gene regulation.

df.to_csv(f"{out_dir}/tf_to_gene.csv")

Analysis Complete#

The Cell2Net transcription factor interpretation analysis is now complete! The interpreted regulatory networks provide mechanistic insights into how Cell2Net learned immune cell gene regulation and offer valuable hypotheses for experimental validation and therapeutic development.