Skip to main content
Ctrl+K

Cell2net 0.13 documentation

  • Installation
  • Tutorials
  • cell2net
  • Changelog
  • References
  • GitHub
  • Installation
  • Tutorials
  • cell2net
  • Changelog
  • References
  • GitHub

Section Navigation

  • K562
    • Pretrain Sequence Encoder for K562
    • Prepare data for training Cell2net model
    • Training Cell2Net models for K562
    • Peak-to-gene analysis with Cell2Net
    • Processing peak-to-gene links for method comparison
    • Evaluating Peak-to-Gene Predictions with Experimental Validation
  • PBMC
    • Pretrain Sequence Encoder for PBMC
    • Prepare Data for Cell2Net Training on PBMC
    • Training Cell2net for PBMC Dataset
    • Cell2Net Model Evaluation with Correlation Analysis
    • Cell2Net Model Interpretation: Transcription Factor Attribution
    • Cell2Net Transcription Factor Activity Analysis
  • Tutorials
  • K562
  • Processing peak-to-gene links for method comparison

Processing peak-to-gene links for method comparison#

This tutorial demonstrates how to process and standardize peak-to-gene predictions from different computational methods for comparative evaluation. We’ll harmonize outputs from multiple approaches including Signac, SCENT, SCARlink, and Cell2Net to enable fair benchmarking.

import os
import pandas as pd
import mudata as md
import numpy as np

md.set_options(pull_on_update=False)
<mudata._core.config.set_options at 0x7efe48eef770>

Setup Output Directory#

Create directory to store processed and standardized results for evaluation.

out_dir = "./05_process_p2g"

os.makedirs(out_dir, exist_ok=True)

Load Multiome Data and Gene Annotations#

Import the preprocessed multiome data to extract gene TSS coordinates for evaluation.

Create standardized gene regions centered around transcription start sites (TSS) for consistent evaluation across methods. We use ±250bp windows around each TSS to define gene regions.

mdata = md.read_h5mu("./02_prepare_data/mdata.h5mu")
df_genes = mdata["rna"].uns["gene_tss_coord"]
df_genes["start"] = df_genes["tss"] - 250
df_genes["end"] = df_genes["tss"] + 250
df_genes.rename(columns={'gene_name': 'gene'}, inplace=True)
df_genes = df_genes[['gene', 'chrom', 'start', 'end']]

Let’s examine the gene annotation structure:

df_genes.head()
gene chrom start end
0 LINC01409 chr1 778497 778997
1 LINC01128 chr1 824888 825388
2 LINC00115 chr1 827272 827772
3 SAMD11 chr1 923678 924178
4 NOC2L chr1 959059 959559

Load Peak-to-Gene Predictions from Different Methods#

Import the results from four different computational approaches for peak-to-gene prediction. Each method uses different algorithms and may have different output formats.

# read peak-to-gene links from Signac, SCENT, SCARlink, and cell2net
df_signac = pd.read_csv(f"./signac.csv")
df_scent = pd.read_csv(f"./scent.csv")
df_scarlink = pd.read_csv(f"./scarlink.csv")
df_cell2net = pd.read_csv(f"./cell2net.csv", index_col=0)
df_signac.head(n=2)
seqnames start end width strand score gene peak zscore pvalue
0 chr1 812689 825138 12450 * 0.014735 LINC01128 chr1-812439-812939 0.453888 0.324955
1 chr1 812689 827522 14834 * 0.013118 LINC00115 chr1-812439-812939 0.527889 0.298788
df_scent.head(n=2)
gene peak beta se z p
0 PLEKHN1 chr1-966191-966691 0.521073 0.248032 2.100832 0.035656
1 PLEKHN1 chr1-966760-967260 0.198338 0.233941 0.847813 0.396542
df_scarlink.head(n=2)
chr start end test_acc test_acc_sparsity regression_coef cell_type z-score p-value gene Spearman corr peak
0 chr1 466000 466500 0.0 0.0 0.004332 K562 0.0 1.0 PLEKHN1 -0.0455 chr1-466000-466000
1 chr1 466500 467000 0.0 0.0 0.015312 K562 0.0 1.0 PLEKHN1 -0.0455 chr1-466500-466500
df_cell2net.head(n=5)
peak gene mean_attr std_attr z_score p_value
0 chr1-966191-966691 PLEKHN1 7.649530e-03 0.006065 2.423978 0.015352
1 chr1-966760-967260 PLEKHN1 6.669266e-03 0.005933 2.053130 0.040060
2 chr1-965473-965973 PLEKHN1 3.220085e-04 0.000438 -0.348127 0.727745
3 chr1-964791-965291 PLEKHN1 8.068260e-07 0.000117 -0.469642 0.638610
4 chr1-967816-968316 PLEKHN1 3.206868e-04 0.000391 -0.348627 0.727369

Standardize Column Names and Extract Relevant Fields#

Each method produces different column names and additional metadata. We extract the essential information (peak, gene, p-value/score) and standardize the naming convention.

df_signac = df_signac[["peak", "gene", "pvalue"]]
df_scent = df_scent[["peak", "gene", "p"]]
df_scarlink = df_scarlink[["peak", "gene", "p-value"]]
df_cell2net = df_cell2net[["peak", "gene", "p_value"]]

df_signac.rename(columns={'pvalue': 'score'}, inplace=True)
df_scent.rename(columns={'p': 'score'}, inplace=True)
df_scarlink.rename(columns={'p-value': 'score'}, inplace=True)
df_cell2net.rename(columns={'p_value': 'score'}, inplace=True)

Convert P-values to Comparable Scores#

Transform p-values to -log10 scale for better interpretability and comparison across methods. Higher scores indicate more significant peak-gene associations. We add a small constant (1e-300) to avoid numerical issues with log(0).

df_signac["score"] = -np.log10(df_signac["score"] + 1e-300)  # add small constant to avoid log(0)
df_scent["score"] = -np.log10(df_scent["score"] + 1e-300)  # add small constant to avoid log(0)
df_scarlink["score"] = -np.log10(df_scarlink["score"] + 1e-300)  # add small constant to avoid log(0)
df_cell2net["score"] = -np.log10(df_cell2net["score"] + 1e-300)  # add small constant to avoid log(0)

Identify Common Genes Across All Methods#

Find genes that are present in all four methods to ensure fair comparison. This filtering step is crucial for benchmarking since each method may analyze different subsets of genes.

common_genes = list(set(df_signac["gene"]) & 
                    set(df_scent["gene"]) &
                    set(df_scarlink["gene"]) &
                    set(df_cell2net["gene"]) &
                    set(df_genes["gene"]))
len(common_genes)
1458

Filter to Common Gene Set#

Restrict all datasets to only include the genes that are present across all methods for fair comparison.

df_signac = df_signac[df_signac["gene"].isin(common_genes)].reset_index(drop=True)
df_scent = df_scent[df_scent["gene"].isin(common_genes)].reset_index(drop=True)
df_scarlink = df_scarlink[df_scarlink["gene"].isin(common_genes)].reset_index(drop=True)
df_cell2net = df_cell2net[df_cell2net["gene"].isin(common_genes)].reset_index(drop=True)

Add Gene TSS Coordinates#

Merge gene TSS coordinates with each dataset to enable consistent evaluation based on genomic distances and overlaps.

# merge gene tss
df_signac = pd.merge(df_signac, df_genes, on="gene").reset_index(drop=True)
df_scent = pd.merge(df_scent, df_genes, on="gene").reset_index(drop=True)
df_scarlink = pd.merge(df_scarlink, df_genes, on="gene").reset_index(drop=True)
df_cell2net = pd.merge(df_cell2net, df_genes, on="gene").reset_index(drop=True)

Standardize Peak and Gene Coordinate Formats#

Create two standardized coordinate formats:

  • Peak1: Peak coordinates in format chr_start_end (replacing hyphens with underscores)

  • Peak2: Gene TSS regions in format chr_start_end for consistent evaluation

This standardization ensures all methods use identical coordinate systems for fair comparison.

df_signac["Peak1"] = df_signac["peak"].str.replace("-", "_")
df_signac["Peak2"] = df_signac["chrom"] + "_" + df_signac["start"].astype(str) + "_" + df_signac["end"].astype(str)

df_scent["Peak1"] = df_scent["peak"].str.replace("-", "_")
df_scent["Peak2"] = df_scent["chrom"] + "_" + df_scent["start"].astype(str) + "_" + df_scent["end"].astype(str)

df_scarlink["Peak1"] = df_scarlink["peak"].str.replace("-", "_")
df_scarlink["Peak2"] = df_scarlink["chrom"] + "_" + df_scarlink["start"].astype(str) + "_" + df_scarlink["end"].astype(str)

df_cell2net["Peak1"] = df_cell2net["peak"].str.replace("-", "_")
df_cell2net["Peak2"] = df_cell2net["chrom"] + "_" + df_cell2net["start"].astype(str) + "_" + df_cell2net["end"].astype(str)

Extract Final Columns for Evaluation#

Keep only the essential columns needed for benchmarking: standardized peak coordinates (Peak1), gene coordinates (Peak2), and significance scores.

df_signac = df_signac[['Peak1', 'Peak2', 'score']]
df_scent = df_scent[['Peak1', 'Peak2', 'score']]
df_scarlink = df_scarlink[['Peak1', 'Peak2', 'score']]
df_cell2net = df_cell2net[['Peak1', 'Peak2', 'score']]

Save Processed Results#

Export the standardized datasets in evaluation-ready format. Each file contains peak-to-gene predictions from one method with:

  • Peak1: Standardized peak coordinates

  • Peak2: Standardized gene TSS coordinates

  • score: -log10 transformed p-values for significance ranking

df_signac.to_csv(f"{out_dir}/signac.csv")
df_scent.to_csv(f"{out_dir}/scent.csv")
df_scarlink.to_csv(f"{out_dir}/scarlink.csv")
df_cell2net.to_csv(f"{out_dir}/cell2net.csv")

previous

Peak-to-gene analysis with Cell2Net

next

Evaluating Peak-to-Gene Predictions with Experimental Validation

On this page
  • Setup Output Directory
  • Load Multiome Data and Gene Annotations
  • Load Peak-to-Gene Predictions from Different Methods
  • Standardize Column Names and Extract Relevant Fields
  • Convert P-values to Comparable Scores
  • Identify Common Genes Across All Methods
  • Filter to Common Gene Set
  • Add Gene TSS Coordinates
  • Standardize Peak and Gene Coordinate Formats
  • Extract Final Columns for Evaluation
  • Save Processed Results
Edit on GitHub

© Copyright 2025, Zhijian Li..

Built with the PyData Sphinx Theme 0.16.1.