Processing peak-to-gene links for method comparison#
This tutorial demonstrates how to process and standardize peak-to-gene predictions from different computational methods for comparative evaluation. We’ll harmonize outputs from multiple approaches including Signac, SCENT, SCARlink, and Cell2Net to enable fair benchmarking.
import os
import pandas as pd
import mudata as md
import numpy as np
md.set_options(pull_on_update=False)
<mudata._core.config.set_options at 0x7efe48eef770>
Setup Output Directory#
Create directory to store processed and standardized results for evaluation.
out_dir = "./05_process_p2g"
os.makedirs(out_dir, exist_ok=True)
Load Multiome Data and Gene Annotations#
Import the preprocessed multiome data to extract gene TSS coordinates for evaluation.
Create standardized gene regions centered around transcription start sites (TSS) for consistent evaluation across methods. We use ±250bp windows around each TSS to define gene regions.
mdata = md.read_h5mu("./02_prepare_data/mdata.h5mu")
df_genes = mdata["rna"].uns["gene_tss_coord"]
df_genes["start"] = df_genes["tss"] - 250
df_genes["end"] = df_genes["tss"] + 250
df_genes.rename(columns={'gene_name': 'gene'}, inplace=True)
df_genes = df_genes[['gene', 'chrom', 'start', 'end']]
Let’s examine the gene annotation structure:
df_genes.head()
| gene | chrom | start | end | |
|---|---|---|---|---|
| 0 | LINC01409 | chr1 | 778497 | 778997 |
| 1 | LINC01128 | chr1 | 824888 | 825388 |
| 2 | LINC00115 | chr1 | 827272 | 827772 |
| 3 | SAMD11 | chr1 | 923678 | 924178 |
| 4 | NOC2L | chr1 | 959059 | 959559 |
Load Peak-to-Gene Predictions from Different Methods#
Import the results from four different computational approaches for peak-to-gene prediction. Each method uses different algorithms and may have different output formats.
# read peak-to-gene links from Signac, SCENT, SCARlink, and cell2net
df_signac = pd.read_csv(f"./signac.csv")
df_scent = pd.read_csv(f"./scent.csv")
df_scarlink = pd.read_csv(f"./scarlink.csv")
df_cell2net = pd.read_csv(f"./cell2net.csv", index_col=0)
df_signac.head(n=2)
| seqnames | start | end | width | strand | score | gene | peak | zscore | pvalue | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | chr1 | 812689 | 825138 | 12450 | * | 0.014735 | LINC01128 | chr1-812439-812939 | 0.453888 | 0.324955 |
| 1 | chr1 | 812689 | 827522 | 14834 | * | 0.013118 | LINC00115 | chr1-812439-812939 | 0.527889 | 0.298788 |
df_scent.head(n=2)
| gene | peak | beta | se | z | p | |
|---|---|---|---|---|---|---|
| 0 | PLEKHN1 | chr1-966191-966691 | 0.521073 | 0.248032 | 2.100832 | 0.035656 |
| 1 | PLEKHN1 | chr1-966760-967260 | 0.198338 | 0.233941 | 0.847813 | 0.396542 |
df_scarlink.head(n=2)
| chr | start | end | test_acc | test_acc_sparsity | regression_coef | cell_type | z-score | p-value | gene | Spearman corr | peak | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | chr1 | 466000 | 466500 | 0.0 | 0.0 | 0.004332 | K562 | 0.0 | 1.0 | PLEKHN1 | -0.0455 | chr1-466000-466000 |
| 1 | chr1 | 466500 | 467000 | 0.0 | 0.0 | 0.015312 | K562 | 0.0 | 1.0 | PLEKHN1 | -0.0455 | chr1-466500-466500 |
df_cell2net.head(n=5)
| peak | gene | mean_attr | std_attr | z_score | p_value | |
|---|---|---|---|---|---|---|
| 0 | chr1-966191-966691 | PLEKHN1 | 7.649530e-03 | 0.006065 | 2.423978 | 0.015352 |
| 1 | chr1-966760-967260 | PLEKHN1 | 6.669266e-03 | 0.005933 | 2.053130 | 0.040060 |
| 2 | chr1-965473-965973 | PLEKHN1 | 3.220085e-04 | 0.000438 | -0.348127 | 0.727745 |
| 3 | chr1-964791-965291 | PLEKHN1 | 8.068260e-07 | 0.000117 | -0.469642 | 0.638610 |
| 4 | chr1-967816-968316 | PLEKHN1 | 3.206868e-04 | 0.000391 | -0.348627 | 0.727369 |
Standardize Column Names and Extract Relevant Fields#
Each method produces different column names and additional metadata. We extract the essential information (peak, gene, p-value/score) and standardize the naming convention.
df_signac = df_signac[["peak", "gene", "pvalue"]]
df_scent = df_scent[["peak", "gene", "p"]]
df_scarlink = df_scarlink[["peak", "gene", "p-value"]]
df_cell2net = df_cell2net[["peak", "gene", "p_value"]]
df_signac.rename(columns={'pvalue': 'score'}, inplace=True)
df_scent.rename(columns={'p': 'score'}, inplace=True)
df_scarlink.rename(columns={'p-value': 'score'}, inplace=True)
df_cell2net.rename(columns={'p_value': 'score'}, inplace=True)
Convert P-values to Comparable Scores#
Transform p-values to -log10 scale for better interpretability and comparison across methods. Higher scores indicate more significant peak-gene associations. We add a small constant (1e-300) to avoid numerical issues with log(0).
df_signac["score"] = -np.log10(df_signac["score"] + 1e-300) # add small constant to avoid log(0)
df_scent["score"] = -np.log10(df_scent["score"] + 1e-300) # add small constant to avoid log(0)
df_scarlink["score"] = -np.log10(df_scarlink["score"] + 1e-300) # add small constant to avoid log(0)
df_cell2net["score"] = -np.log10(df_cell2net["score"] + 1e-300) # add small constant to avoid log(0)
Identify Common Genes Across All Methods#
Find genes that are present in all four methods to ensure fair comparison. This filtering step is crucial for benchmarking since each method may analyze different subsets of genes.
common_genes = list(set(df_signac["gene"]) &
set(df_scent["gene"]) &
set(df_scarlink["gene"]) &
set(df_cell2net["gene"]) &
set(df_genes["gene"]))
len(common_genes)
1458
Filter to Common Gene Set#
Restrict all datasets to only include the genes that are present across all methods for fair comparison.
df_signac = df_signac[df_signac["gene"].isin(common_genes)].reset_index(drop=True)
df_scent = df_scent[df_scent["gene"].isin(common_genes)].reset_index(drop=True)
df_scarlink = df_scarlink[df_scarlink["gene"].isin(common_genes)].reset_index(drop=True)
df_cell2net = df_cell2net[df_cell2net["gene"].isin(common_genes)].reset_index(drop=True)
Add Gene TSS Coordinates#
Merge gene TSS coordinates with each dataset to enable consistent evaluation based on genomic distances and overlaps.
# merge gene tss
df_signac = pd.merge(df_signac, df_genes, on="gene").reset_index(drop=True)
df_scent = pd.merge(df_scent, df_genes, on="gene").reset_index(drop=True)
df_scarlink = pd.merge(df_scarlink, df_genes, on="gene").reset_index(drop=True)
df_cell2net = pd.merge(df_cell2net, df_genes, on="gene").reset_index(drop=True)
Standardize Peak and Gene Coordinate Formats#
Create two standardized coordinate formats:
Peak1: Peak coordinates in format
chr_start_end(replacing hyphens with underscores)Peak2: Gene TSS regions in format
chr_start_endfor consistent evaluation
This standardization ensures all methods use identical coordinate systems for fair comparison.
df_signac["Peak1"] = df_signac["peak"].str.replace("-", "_")
df_signac["Peak2"] = df_signac["chrom"] + "_" + df_signac["start"].astype(str) + "_" + df_signac["end"].astype(str)
df_scent["Peak1"] = df_scent["peak"].str.replace("-", "_")
df_scent["Peak2"] = df_scent["chrom"] + "_" + df_scent["start"].astype(str) + "_" + df_scent["end"].astype(str)
df_scarlink["Peak1"] = df_scarlink["peak"].str.replace("-", "_")
df_scarlink["Peak2"] = df_scarlink["chrom"] + "_" + df_scarlink["start"].astype(str) + "_" + df_scarlink["end"].astype(str)
df_cell2net["Peak1"] = df_cell2net["peak"].str.replace("-", "_")
df_cell2net["Peak2"] = df_cell2net["chrom"] + "_" + df_cell2net["start"].astype(str) + "_" + df_cell2net["end"].astype(str)
Extract Final Columns for Evaluation#
Keep only the essential columns needed for benchmarking: standardized peak coordinates (Peak1), gene coordinates (Peak2), and significance scores.
df_signac = df_signac[['Peak1', 'Peak2', 'score']]
df_scent = df_scent[['Peak1', 'Peak2', 'score']]
df_scarlink = df_scarlink[['Peak1', 'Peak2', 'score']]
df_cell2net = df_cell2net[['Peak1', 'Peak2', 'score']]
Save Processed Results#
Export the standardized datasets in evaluation-ready format. Each file contains peak-to-gene predictions from one method with:
Peak1: Standardized peak coordinates
Peak2: Standardized gene TSS coordinates
score: -log10 transformed p-values for significance ranking
df_signac.to_csv(f"{out_dir}/signac.csv")
df_scent.to_csv(f"{out_dir}/scent.csv")
df_scarlink.to_csv(f"{out_dir}/scarlink.csv")
df_cell2net.to_csv(f"{out_dir}/cell2net.csv")