Peak-to-gene analysis with Cell2Net#

This tutorial demonstrates how to perform peak-to-gene analysis using trained Cell2Net models from the previous tutorial. We’ll compute attribution scores to identify which chromatin accessibility peaks contribute most to gene expression predictions.

Overview#

Peak-to-gene analysis is a crucial step in understanding regulatory relationships in single-cell multiome data. Cell2Net provides interpretable predictions by:

Computing attribution scores for each peak-gene pair using integrated gradients
Quantifying regulatory importance of individual accessibility peaks
Identifying distal regulatory elements that influence gene expression
Creating interpretable regulatory networks from attention mechanisms

import os
import warnings

warnings.filterwarnings("ignore")

import numpy as np
import mudata as md
import pandas as pd
from tqdm import tqdm
import cell2net as cn
md.set_options(pull_on_update=False)

<mudata._core.config.set_options at 0x7fbaf9e46f60>

Setup Directories and Paths#

Configure paths to access the preprocessed data and trained models from previous tutorials.

data_dir = "./02_prepare_data/mdata.h5mu"
in_dir = "./03_train"
out_dir = "./04_peak_to_gene"

os.makedirs(out_dir, exist_ok=True)

Setup Directories and Paths#

Configure paths to access the preprocessed data and trained models from previous tutorials.

mdata_bulk = md.read_h5mu(data_dir)

Extract Target Genes#

Extract the list of genes for which we have trained Cell2Net models in the previous tutorial.

genes = mdata_bulk.uns['peak_to_gene']['gene'].unique().tolist()

Check how many genes we’ll analyze:

len(genes)

Compute Peak Attribution Scores#

This is the main analysis loop where we:

Load trained models for each gene from the previous tutorial
Compute peak attribution scores using integrated gradients method
Convert attributions to peak-to-gene links with statistical measures
Save both raw attributions and processed results for each gene

Technical Details:#

Integrated Gradients: Computes attribution by integrating gradients along a path from baseline to input
n_steps=50: Number of interpolation steps for gradient integration (higher = more accurate)
multiply_by_inputs=True: Multiplies gradients by input values for better interpretability
batch_size=2: Process 2 samples at a time to balance memory usage and speed

Output Files:#

{gene}.npy: Raw attribution scores for all peaks
{gene}.csv: Processed peak-to-gene links with statistical significance

This step typically takes 30-60 seconds per gene depending on the number of associated peaks.

for gene in tqdm(genes):
    if os.path.exists(f"{out_dir}/{gene}.npy"):
        continue
    
    model = cn.pd.model.Cell2Net(mdata=mdata_bulk, 
                                 gene=gene, 
                                 covariates=['total_counts_rna_log', 'total_counts_atac_log'])

    model.load(dir_path=f"{in_dir}/model")
    model.to_device('cuda:0')
    
    peak_attr = cn.ip.peak_attr(model, batch_size=2, n_steps=50, multiply_by_inputs=True)
    df = cn.ip.peak_to_gene(model.mdata, peak_attr)

    np.save(f"{out_dir}/{gene}.npy", peak_attr)
    df.to_csv(f"{out_dir}/{gene}.csv", index=False)

Aggregate Peak-to-Gene Results#

Now we combine the individual gene results into a comprehensive peak-to-gene network for the entire dataset.

df_list = []
for gene in genes:
    df = pd.read_csv(f"{out_dir}/{gene}.csv")

    df_list.append(df)
df_p2g = pd.concat(df_list).reset_index(drop=True)

Let’s examine the structure of our peak-to-gene network:

df_p2g

	peak	gene	mean_attr	std_attr	z_score	p_value
0	chr1-966191-966691	PLEKHN1	7.649530e-03	0.006065	2.423978	0.015352
1	chr1-966760-967260	PLEKHN1	6.669266e-03	0.005933	2.053130	0.040060
2	chr1-965473-965973	PLEKHN1	3.220085e-04	0.000438	-0.348127	0.727745
3	chr1-964791-965291	PLEKHN1	8.068260e-07	0.000117	-0.469642	0.638610
4	chr1-967816-968316	PLEKHN1	3.206868e-04	0.000391	-0.348627	0.727369
...	...	...	...	...	...	...
128319	chrX-154308282-154308782	GAB3	1.091376e-03	0.001618	-0.218409	0.827110
128320	chrX-155215382-155215882	GAB3	-2.899780e-05	0.000866	-0.358034	0.720318
128321	chrX-155216128-155216628	GAB3	2.668623e-03	0.005609	-0.021848	0.982569
128322	chrX-155216639-155217139	GAB3	-5.110913e-04	0.004339	-0.418114	0.675864
128323	chrX-155218743-155219243	GAB3	2.417117e-05	0.000726	-0.351408	0.725283

128324 rows × 6 columns

Save Final Results#

Save the comprehensive peak-to-gene network for downstream analysis and visualization.

df_p2g.to_csv(f'{out_dir}/peak_to_gene.csv')

Understanding the Results#

The peak-to-gene analysis produces several key outputs:

Peak Attribution Scores (`.npy` files):#

Raw attribution values for each peak-gene pair
Positive values: Peaks that increase gene expression predictions
Negative values: Peaks that decrease gene expression predictions
Magnitude: Indicates the strength of the regulatory relationship

Peak-to-Gene Links (`.csv` files):#

Statistical significance: p-values and confidence intervals for each link
Effect sizes: Quantified regulatory impact of each peak
Genomic coordinates: Precise locations of regulatory elements
Distance information: Peak-to-TSS distances for regulatory context

Comprehensive Network (`peak_to_gene.csv`):#

Genome-wide regulatory map combining all gene-specific results
Filterable by significance to focus on high-confidence links
Ready for network analysis and visualization tools
Compatible with pathway analysis and functional enrichment

Next Steps#

With these results, you can:

Visualize regulatory networks using network analysis tools
Identify key regulatory hubs and super-enhancer regions
Compare with experimental data (ChIP-seq, Hi-C, etc.)
Perform functional enrichment of target genes
Investigate cell-type-specific regulation across conditions
Build predictive models of gene expression from chromatin state

The interpretable peak-to-gene links provide a foundation for understanding regulatory mechanisms and designing targeted perturbation experiments.

Peak-to-gene analysis with Cell2Net#

Overview#

Setup Directories and Paths#

Setup Directories and Paths#

Extract Target Genes#

Compute Peak Attribution Scores#

Technical Details:#

Output Files:#

Aggregate Peak-to-Gene Results#

Save Final Results#

Understanding the Results#

Peak Attribution Scores (.npy files):#

Peak-to-Gene Links (.csv files):#

Comprehensive Network (peak_to_gene.csv):#

Next Steps#

Peak Attribution Scores (`.npy` files):#

Peak-to-Gene Links (`.csv` files):#

Comprehensive Network (`peak_to_gene.csv`):#