Processing peak-to-gene links for method comparison#

This tutorial demonstrates how to process and standardize peak-to-gene predictions from different computational methods for comparative evaluation. We’ll harmonize outputs from multiple approaches including Signac, SCENT, SCARlink, and Cell2Net to enable fair benchmarking.

import os
import pandas as pd
import mudata as md
import numpy as np

md.set_options(pull_on_update=False)

<mudata._core.config.set_options at 0x7efe48eef770>

Setup Output Directory#

Create directory to store processed and standardized results for evaluation.

out_dir = "./05_process_p2g"

os.makedirs(out_dir, exist_ok=True)

Load Multiome Data and Gene Annotations#

Import the preprocessed multiome data to extract gene TSS coordinates for evaluation.

Create standardized gene regions centered around transcription start sites (TSS) for consistent evaluation across methods. We use ±250bp windows around each TSS to define gene regions.

mdata = md.read_h5mu("./02_prepare_data/mdata.h5mu")
df_genes = mdata["rna"].uns["gene_tss_coord"]
df_genes["start"] = df_genes["tss"] - 250
df_genes["end"] = df_genes["tss"] + 250
df_genes.rename(columns={'gene_name': 'gene'}, inplace=True)
df_genes = df_genes[['gene', 'chrom', 'start', 'end']]

Let’s examine the gene annotation structure:

df_genes.head()

	gene	chrom	start	end
0	LINC01409	chr1	778497	778997
1	LINC01128	chr1	824888	825388
2	LINC00115	chr1	827272	827772
3	SAMD11	chr1	923678	924178
4	NOC2L	chr1	959059	959559

Load Peak-to-Gene Predictions from Different Methods#

Import the results from four different computational approaches for peak-to-gene prediction. Each method uses different algorithms and may have different output formats.

# read peak-to-gene links from Signac, SCENT, SCARlink, and cell2net
df_signac = pd.read_csv(f"./signac.csv")
df_scent = pd.read_csv(f"./scent.csv")
df_scarlink = pd.read_csv(f"./scarlink.csv")
df_cell2net = pd.read_csv(f"./cell2net.csv", index_col=0)

df_signac.head(n=2)

	seqnames	start	end	width	strand	score	gene	peak	zscore	pvalue
0	chr1	812689	825138	12450	*	0.014735	LINC01128	chr1-812439-812939	0.453888	0.324955
1	chr1	812689	827522	14834	*	0.013118	LINC00115	chr1-812439-812939	0.527889	0.298788

df_scent.head(n=2)

	gene	peak	beta	se	z	p
0	PLEKHN1	chr1-966191-966691	0.521073	0.248032	2.100832	0.035656
1	PLEKHN1	chr1-966760-967260	0.198338	0.233941	0.847813	0.396542

df_scarlink.head(n=2)

	chr	start	end	test_acc	test_acc_sparsity	regression_coef	cell_type	z-score	p-value	gene	Spearman corr	peak
0	chr1	466000	466500	0.0	0.0	0.004332	K562	0.0	1.0	PLEKHN1	-0.0455	chr1-466000-466000
1	chr1	466500	467000	0.0	0.0	0.015312	K562	0.0	1.0	PLEKHN1	-0.0455	chr1-466500-466500

df_cell2net.head(n=5)

	peak	gene	mean_attr	std_attr	z_score	p_value
0	chr1-966191-966691	PLEKHN1	7.649530e-03	0.006065	2.423978	0.015352
1	chr1-966760-967260	PLEKHN1	6.669266e-03	0.005933	2.053130	0.040060
2	chr1-965473-965973	PLEKHN1	3.220085e-04	0.000438	-0.348127	0.727745
3	chr1-964791-965291	PLEKHN1	8.068260e-07	0.000117	-0.469642	0.638610
4	chr1-967816-968316	PLEKHN1	3.206868e-04	0.000391	-0.348627	0.727369

Standardize Column Names and Extract Relevant Fields#

Each method produces different column names and additional metadata. We extract the essential information (peak, gene, p-value/score) and standardize the naming convention.

df_signac = df_signac[["peak", "gene", "pvalue"]]
df_scent = df_scent[["peak", "gene", "p"]]
df_scarlink = df_scarlink[["peak", "gene", "p-value"]]
df_cell2net = df_cell2net[["peak", "gene", "p_value"]]

df_signac.rename(columns={'pvalue': 'score'}, inplace=True)
df_scent.rename(columns={'p': 'score'}, inplace=True)
df_scarlink.rename(columns={'p-value': 'score'}, inplace=True)
df_cell2net.rename(columns={'p_value': 'score'}, inplace=True)

Convert P-values to Comparable Scores#

Transform p-values to -log10 scale for better interpretability and comparison across methods. Higher scores indicate more significant peak-gene associations. We add a small constant (1e-300) to avoid numerical issues with log(0).

df_signac["score"] = -np.log10(df_signac["score"] + 1e-300)  # add small constant to avoid log(0)
df_scent["score"] = -np.log10(df_scent["score"] + 1e-300)  # add small constant to avoid log(0)
df_scarlink["score"] = -np.log10(df_scarlink["score"] + 1e-300)  # add small constant to avoid log(0)
df_cell2net["score"] = -np.log10(df_cell2net["score"] + 1e-300)  # add small constant to avoid log(0)

Identify Common Genes Across All Methods#

Find genes that are present in all four methods to ensure fair comparison. This filtering step is crucial for benchmarking since each method may analyze different subsets of genes.

common_genes = list(set(df_signac["gene"]) & 
                    set(df_scent["gene"]) &
                    set(df_scarlink["gene"]) &
                    set(df_cell2net["gene"]) &
                    set(df_genes["gene"]))
len(common_genes)

Filter to Common Gene Set#

Restrict all datasets to only include the genes that are present across all methods for fair comparison.

df_signac = df_signac[df_signac["gene"].isin(common_genes)].reset_index(drop=True)
df_scent = df_scent[df_scent["gene"].isin(common_genes)].reset_index(drop=True)
df_scarlink = df_scarlink[df_scarlink["gene"].isin(common_genes)].reset_index(drop=True)
df_cell2net = df_cell2net[df_cell2net["gene"].isin(common_genes)].reset_index(drop=True)

Add Gene TSS Coordinates#

Merge gene TSS coordinates with each dataset to enable consistent evaluation based on genomic distances and overlaps.

# merge gene tss
df_signac = pd.merge(df_signac, df_genes, on="gene").reset_index(drop=True)
df_scent = pd.merge(df_scent, df_genes, on="gene").reset_index(drop=True)
df_scarlink = pd.merge(df_scarlink, df_genes, on="gene").reset_index(drop=True)
df_cell2net = pd.merge(df_cell2net, df_genes, on="gene").reset_index(drop=True)

Standardize Peak and Gene Coordinate Formats#

Create two standardized coordinate formats:

Peak1: Peak coordinates in format chr_start_end (replacing hyphens with underscores)
Peak2: Gene TSS regions in format chr_start_end for consistent evaluation

This standardization ensures all methods use identical coordinate systems for fair comparison.

df_signac["Peak1"] = df_signac["peak"].str.replace("-", "_")
df_signac["Peak2"] = df_signac["chrom"] + "_" + df_signac["start"].astype(str) + "_" + df_signac["end"].astype(str)

df_scent["Peak1"] = df_scent["peak"].str.replace("-", "_")
df_scent["Peak2"] = df_scent["chrom"] + "_" + df_scent["start"].astype(str) + "_" + df_scent["end"].astype(str)

df_scarlink["Peak1"] = df_scarlink["peak"].str.replace("-", "_")
df_scarlink["Peak2"] = df_scarlink["chrom"] + "_" + df_scarlink["start"].astype(str) + "_" + df_scarlink["end"].astype(str)

df_cell2net["Peak1"] = df_cell2net["peak"].str.replace("-", "_")
df_cell2net["Peak2"] = df_cell2net["chrom"] + "_" + df_cell2net["start"].astype(str) + "_" + df_cell2net["end"].astype(str)

Extract Final Columns for Evaluation#

Keep only the essential columns needed for benchmarking: standardized peak coordinates (Peak1), gene coordinates (Peak2), and significance scores.

df_signac = df_signac[['Peak1', 'Peak2', 'score']]
df_scent = df_scent[['Peak1', 'Peak2', 'score']]
df_scarlink = df_scarlink[['Peak1', 'Peak2', 'score']]
df_cell2net = df_cell2net[['Peak1', 'Peak2', 'score']]

Save Processed Results#

Export the standardized datasets in evaluation-ready format. Each file contains peak-to-gene predictions from one method with:

Peak1: Standardized peak coordinates
Peak2: Standardized gene TSS coordinates
score: -log10 transformed p-values for significance ranking

df_signac.to_csv(f"{out_dir}/signac.csv")
df_scent.to_csv(f"{out_dir}/scent.csv")
df_scarlink.to_csv(f"{out_dir}/scarlink.csv")
df_cell2net.to_csv(f"{out_dir}/cell2net.csv")