cell2net.evaluation.causal_var_enrichment_in_peaks#

cell2net.evaluation.causal_var_enrichment_in_peaks(df_p2g, causal_var, common_var, gene_gtf, ref_fasta, up_stream=500000, down_stream=500000)#

Compute the enrichment of causal variants in peak regions linked to genes.

For each gene, this function calculates the enrichment of causal variants in gene-associated peaks by comparing the proportion of causal variants overlapping with peaks to the proportion of common variants overlapping with peaks after accounting for background regions defined as a genomic window around the transcription start site (TSS) for this gene.

It can be used to evaluate the peak-to-gene links predicted by different models, such as Cell2net, SCARLink and SCENT.

Parameters:

df_p2g (DataFrame) – DataFrame containing peak-to-gene (P2G) link information. Expected columns: [‘peak’, ‘gene’]. The peak column should be formatted as ‘chrom-start-end’.
causal_var (DataFrame) – DataFrame containing causal variant information with columns [‘Chromosome’, ‘Start’, ‘End’, ‘Gene’]. Each row presents a link between causal variant and its target gene.
common_var (DataFrame) – DataFrame containing common variant information with columns [‘Chromosome’, ‘Start’, ‘End’].
gene_gtf (str) – File path to the gene annotation GTF file which will be used to extract TSS for each gene to create background regions.
ref_fasta (str) – File path to the reference genome FASTA file.
up_stream (int (default: 500000)) – Number of base pairs upstream of the TSS to include in the background region.
down_stream (int (default: 500000)) – Number of base pairs downstream of the TSS to include in the background region.

Return type:

DataFrame

Returns:

A DataFrame summarizing the enrichment results for each gene with following columns:

gene: Gene name.
n_causal_var_in_peak: Number of causal variants overlapping peaks.
n_causal_var_in_background: Number of causal variants within the background regions.
n_common_var_in_peak: Number of common variants overlapping peaks.
n_common_var_in_background: Number of common variants within the background regions.
enrichment: Enrichment score (ratio of causal-to-common variants in peaks vs. background regions).

Notes

The function utilizes PyRanges for efficient genomic range operations.
Variants not overlapping the background regions or peaks are excluded from the enrichment calculation.
If no common variants are found within a gene’s background regions, that gene is skipped.
Enrichment is computed as follows:

\(enrichment = \frac{n\_causal\_var\_in\_peak}{n\_causal\_var\_in\_background} / \frac{n\_common\_var\_in\_peak}{n\_common\_var\_in\_background}\)
When no common variants overlap peaks or causal variants overlap the background regions.

Example

>>> df_p2g = pd.DataFrame({
...     'peak': ['chr1-1000-2000', 'chr1-3000-4000'],
...     'gene': ['GeneA', 'GeneB']
... })
>>> causal_var = pd.DataFrame({
...     'Chromosome': ['chr1', 'chr1'],
...     'Start': [1500, 3500],
...     'End': [1501, 3501],
...     'Gene': ['GeneA', 'GeneB']
... })
>>> common_var = pd.DataFrame({
...     'Chromosome': ['chr1', 'chr1'],
...     'Start': [1600, 3600],
...     'End': [1601, 3601]
... })
>>> gene_gtf = "path/to/annotation.gtf"
>>> ref_fasta = "path/to/reference.fasta"
>>> result = causal_var_enrichment_in_peaks(
...     df_p2g, causal_var, common_var, gene_gtf, ref_fasta
... )
>>> print(result)
    gene  n_causal_var_in_peak  n_causal_var_in_background  n_common_var_in_peak  n_common_var_in_background  enrichment
0    GeneA                     1                    1                     1                    1    1.000000
1    GeneB                     1                    1                     1                    1    1.000000