Peak-to-gene analysis with Cell2Net#

This tutorial demonstrates how to perform peak-to-gene analysis using trained Cell2Net models from the previous tutorial. We’ll compute attribution scores to identify which chromatin accessibility peaks contribute most to gene expression predictions.

Overview#

Peak-to-gene analysis is a crucial step in understanding regulatory relationships in single-cell multiome data. Cell2Net provides interpretable predictions by:

  1. Computing attribution scores for each peak-gene pair using integrated gradients

  2. Quantifying regulatory importance of individual accessibility peaks

  3. Identifying distal regulatory elements that influence gene expression

  4. Creating interpretable regulatory networks from attention mechanisms

import os
import warnings

warnings.filterwarnings("ignore")

import numpy as np
import mudata as md
import pandas as pd
from tqdm import tqdm
import cell2net as cn
md.set_options(pull_on_update=False)
<mudata._core.config.set_options at 0x7fbaf9e46f60>

Setup Directories and Paths#

Configure paths to access the preprocessed data and trained models from previous tutorials.

data_dir = "./02_prepare_data/mdata.h5mu"
in_dir = "./03_train"
out_dir = "./04_peak_to_gene"

os.makedirs(out_dir, exist_ok=True)

Setup Directories and Paths#

Configure paths to access the preprocessed data and trained models from previous tutorials.

mdata_bulk = md.read_h5mu(data_dir)

Extract Target Genes#

Extract the list of genes for which we have trained Cell2Net models in the previous tutorial.

genes = mdata_bulk.uns['peak_to_gene']['gene'].unique().tolist()

Check how many genes we’ll analyze:

len(genes)
1608

Compute Peak Attribution Scores#

This is the main analysis loop where we:

  1. Load trained models for each gene from the previous tutorial

  2. Compute peak attribution scores using integrated gradients method

  3. Convert attributions to peak-to-gene links with statistical measures

  4. Save both raw attributions and processed results for each gene

Technical Details:#

  • Integrated Gradients: Computes attribution by integrating gradients along a path from baseline to input

  • n_steps=50: Number of interpolation steps for gradient integration (higher = more accurate)

  • multiply_by_inputs=True: Multiplies gradients by input values for better interpretability

  • batch_size=2: Process 2 samples at a time to balance memory usage and speed

Output Files:#

  • {gene}.npy: Raw attribution scores for all peaks

  • {gene}.csv: Processed peak-to-gene links with statistical significance

This step typically takes 30-60 seconds per gene depending on the number of associated peaks.

for gene in tqdm(genes):
    if os.path.exists(f"{out_dir}/{gene}.npy"):
        continue
    
    model = cn.pd.model.Cell2Net(mdata=mdata_bulk, 
                                 gene=gene, 
                                 covariates=['total_counts_rna_log', 'total_counts_atac_log'])

    model.load(dir_path=f"{in_dir}/model")
    model.to_device('cuda:0')
    
    peak_attr = cn.ip.peak_attr(model, batch_size=2, n_steps=50, multiply_by_inputs=True)
    df = cn.ip.peak_to_gene(model.mdata, peak_attr)

    np.save(f"{out_dir}/{gene}.npy", peak_attr)
    df.to_csv(f"{out_dir}/{gene}.csv", index=False)

Aggregate Peak-to-Gene Results#

Now we combine the individual gene results into a comprehensive peak-to-gene network for the entire dataset.

df_list = []
for gene in genes:
    df = pd.read_csv(f"{out_dir}/{gene}.csv")

    df_list.append(df)
df_p2g = pd.concat(df_list).reset_index(drop=True)

Let’s examine the structure of our peak-to-gene network:

df_p2g
peak gene mean_attr std_attr z_score p_value
0 chr1-966191-966691 PLEKHN1 7.649530e-03 0.006065 2.423978 0.015352
1 chr1-966760-967260 PLEKHN1 6.669266e-03 0.005933 2.053130 0.040060
2 chr1-965473-965973 PLEKHN1 3.220085e-04 0.000438 -0.348127 0.727745
3 chr1-964791-965291 PLEKHN1 8.068260e-07 0.000117 -0.469642 0.638610
4 chr1-967816-968316 PLEKHN1 3.206868e-04 0.000391 -0.348627 0.727369
... ... ... ... ... ... ...
128319 chrX-154308282-154308782 GAB3 1.091376e-03 0.001618 -0.218409 0.827110
128320 chrX-155215382-155215882 GAB3 -2.899780e-05 0.000866 -0.358034 0.720318
128321 chrX-155216128-155216628 GAB3 2.668623e-03 0.005609 -0.021848 0.982569
128322 chrX-155216639-155217139 GAB3 -5.110913e-04 0.004339 -0.418114 0.675864
128323 chrX-155218743-155219243 GAB3 2.417117e-05 0.000726 -0.351408 0.725283

128324 rows × 6 columns

Save Final Results#

Save the comprehensive peak-to-gene network for downstream analysis and visualization.

df_p2g.to_csv(f'{out_dir}/peak_to_gene.csv')

Understanding the Results#

The peak-to-gene analysis produces several key outputs:

Peak Attribution Scores (.npy files):#

  • Raw attribution values for each peak-gene pair

  • Positive values: Peaks that increase gene expression predictions

  • Negative values: Peaks that decrease gene expression predictions

  • Magnitude: Indicates the strength of the regulatory relationship

Comprehensive Network (peak_to_gene.csv):#

  • Genome-wide regulatory map combining all gene-specific results

  • Filterable by significance to focus on high-confidence links

  • Ready for network analysis and visualization tools

  • Compatible with pathway analysis and functional enrichment

Next Steps#

With these results, you can:

  1. Visualize regulatory networks using network analysis tools

  2. Identify key regulatory hubs and super-enhancer regions

  3. Compare with experimental data (ChIP-seq, Hi-C, etc.)

  4. Perform functional enrichment of target genes

  5. Investigate cell-type-specific regulation across conditions

  6. Build predictive models of gene expression from chromatin state

The interpretable peak-to-gene links provide a foundation for understanding regulatory mechanisms and designing targeted perturbation experiments.