Peak-to-gene analysis with Cell2Net#
This tutorial demonstrates how to perform peak-to-gene analysis using trained Cell2Net models from the previous tutorial. We’ll compute attribution scores to identify which chromatin accessibility peaks contribute most to gene expression predictions.
Overview#
Peak-to-gene analysis is a crucial step in understanding regulatory relationships in single-cell multiome data. Cell2Net provides interpretable predictions by:
Computing attribution scores for each peak-gene pair using integrated gradients
Quantifying regulatory importance of individual accessibility peaks
Identifying distal regulatory elements that influence gene expression
Creating interpretable regulatory networks from attention mechanisms
import os
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import mudata as md
import pandas as pd
from tqdm import tqdm
import cell2net as cn
md.set_options(pull_on_update=False)
<mudata._core.config.set_options at 0x7fbaf9e46f60>
Setup Directories and Paths#
Configure paths to access the preprocessed data and trained models from previous tutorials.
data_dir = "./02_prepare_data/mdata.h5mu"
in_dir = "./03_train"
out_dir = "./04_peak_to_gene"
os.makedirs(out_dir, exist_ok=True)
Setup Directories and Paths#
Configure paths to access the preprocessed data and trained models from previous tutorials.
mdata_bulk = md.read_h5mu(data_dir)
Extract Target Genes#
Extract the list of genes for which we have trained Cell2Net models in the previous tutorial.
genes = mdata_bulk.uns['peak_to_gene']['gene'].unique().tolist()
Check how many genes we’ll analyze:
len(genes)
1608
Compute Peak Attribution Scores#
This is the main analysis loop where we:
Load trained models for each gene from the previous tutorial
Compute peak attribution scores using integrated gradients method
Convert attributions to peak-to-gene links with statistical measures
Save both raw attributions and processed results for each gene
Technical Details:#
Integrated Gradients: Computes attribution by integrating gradients along a path from baseline to input
n_steps=50: Number of interpolation steps for gradient integration (higher = more accurate)
multiply_by_inputs=True: Multiplies gradients by input values for better interpretability
batch_size=2: Process 2 samples at a time to balance memory usage and speed
Output Files:#
{gene}.npy: Raw attribution scores for all peaks
{gene}.csv: Processed peak-to-gene links with statistical significance
This step typically takes 30-60 seconds per gene depending on the number of associated peaks.
for gene in tqdm(genes):
if os.path.exists(f"{out_dir}/{gene}.npy"):
continue
model = cn.pd.model.Cell2Net(mdata=mdata_bulk,
gene=gene,
covariates=['total_counts_rna_log', 'total_counts_atac_log'])
model.load(dir_path=f"{in_dir}/model")
model.to_device('cuda:0')
peak_attr = cn.ip.peak_attr(model, batch_size=2, n_steps=50, multiply_by_inputs=True)
df = cn.ip.peak_to_gene(model.mdata, peak_attr)
np.save(f"{out_dir}/{gene}.npy", peak_attr)
df.to_csv(f"{out_dir}/{gene}.csv", index=False)
Aggregate Peak-to-Gene Results#
Now we combine the individual gene results into a comprehensive peak-to-gene network for the entire dataset.
df_list = []
for gene in genes:
df = pd.read_csv(f"{out_dir}/{gene}.csv")
df_list.append(df)
df_p2g = pd.concat(df_list).reset_index(drop=True)
Let’s examine the structure of our peak-to-gene network:
df_p2g
| peak | gene | mean_attr | std_attr | z_score | p_value | |
|---|---|---|---|---|---|---|
| 0 | chr1-966191-966691 | PLEKHN1 | 7.649530e-03 | 0.006065 | 2.423978 | 0.015352 |
| 1 | chr1-966760-967260 | PLEKHN1 | 6.669266e-03 | 0.005933 | 2.053130 | 0.040060 |
| 2 | chr1-965473-965973 | PLEKHN1 | 3.220085e-04 | 0.000438 | -0.348127 | 0.727745 |
| 3 | chr1-964791-965291 | PLEKHN1 | 8.068260e-07 | 0.000117 | -0.469642 | 0.638610 |
| 4 | chr1-967816-968316 | PLEKHN1 | 3.206868e-04 | 0.000391 | -0.348627 | 0.727369 |
| ... | ... | ... | ... | ... | ... | ... |
| 128319 | chrX-154308282-154308782 | GAB3 | 1.091376e-03 | 0.001618 | -0.218409 | 0.827110 |
| 128320 | chrX-155215382-155215882 | GAB3 | -2.899780e-05 | 0.000866 | -0.358034 | 0.720318 |
| 128321 | chrX-155216128-155216628 | GAB3 | 2.668623e-03 | 0.005609 | -0.021848 | 0.982569 |
| 128322 | chrX-155216639-155217139 | GAB3 | -5.110913e-04 | 0.004339 | -0.418114 | 0.675864 |
| 128323 | chrX-155218743-155219243 | GAB3 | 2.417117e-05 | 0.000726 | -0.351408 | 0.725283 |
128324 rows × 6 columns
Save Final Results#
Save the comprehensive peak-to-gene network for downstream analysis and visualization.
df_p2g.to_csv(f'{out_dir}/peak_to_gene.csv')
Understanding the Results#
The peak-to-gene analysis produces several key outputs:
Peak Attribution Scores (.npy files):#
Raw attribution values for each peak-gene pair
Positive values: Peaks that increase gene expression predictions
Negative values: Peaks that decrease gene expression predictions
Magnitude: Indicates the strength of the regulatory relationship
Peak-to-Gene Links (.csv files):#
Statistical significance: p-values and confidence intervals for each link
Effect sizes: Quantified regulatory impact of each peak
Genomic coordinates: Precise locations of regulatory elements
Distance information: Peak-to-TSS distances for regulatory context
Comprehensive Network (peak_to_gene.csv):#
Genome-wide regulatory map combining all gene-specific results
Filterable by significance to focus on high-confidence links
Ready for network analysis and visualization tools
Compatible with pathway analysis and functional enrichment
Next Steps#
With these results, you can:
Visualize regulatory networks using network analysis tools
Identify key regulatory hubs and super-enhancer regions
Compare with experimental data (ChIP-seq, Hi-C, etc.)
Perform functional enrichment of target genes
Investigate cell-type-specific regulation across conditions
Build predictive models of gene expression from chromatin state
The interpretable peak-to-gene links provide a foundation for understanding regulatory mechanisms and designing targeted perturbation experiments.