cell2net.evaluation.causal_var_enrichment_in_peaks#

cell2net.evaluation.causal_var_enrichment_in_peaks(df_p2g, causal_var, common_var, gene_gtf, ref_fasta, up_stream=500000, down_stream=500000)#

Compute the enrichment of causal variants in peak regions linked to genes.

For each gene, this function calculates the enrichment of causal variants in gene-associated peaks by comparing the proportion of causal variants overlapping with peaks to the proportion of common variants overlapping with peaks after accounting for background regions defined as a genomic window around the transcription start site (TSS) for this gene.

It can be used to evaluate the peak-to-gene links predicted by different models, such as Cell2net, SCARLink and SCENT.

Parameters:
  • df_p2g (DataFrame) – DataFrame containing peak-to-gene (P2G) link information. Expected columns: [‘peak’, ‘gene’]. The peak column should be formatted as ‘chrom-start-end’.

  • causal_var (DataFrame) – DataFrame containing causal variant information with columns [‘Chromosome’, ‘Start’, ‘End’, ‘Gene’]. Each row presents a link between causal variant and its target gene.

  • common_var (DataFrame) – DataFrame containing common variant information with columns [‘Chromosome’, ‘Start’, ‘End’].

  • gene_gtf (str) – File path to the gene annotation GTF file which will be used to extract TSS for each gene to create background regions.

  • ref_fasta (str) – File path to the reference genome FASTA file.

  • up_stream (int (default: 500000)) – Number of base pairs upstream of the TSS to include in the background region.

  • down_stream (int (default: 500000)) – Number of base pairs downstream of the TSS to include in the background region.

Return type:

DataFrame

Returns:

A DataFrame summarizing the enrichment results for each gene with following columns:

  • gene: Gene name.

  • n_causal_var_in_peak: Number of causal variants overlapping peaks.

  • n_causal_var_in_background: Number of causal variants within the background regions.

  • n_common_var_in_peak: Number of common variants overlapping peaks.

  • n_common_var_in_background: Number of common variants within the background regions.

  • enrichment: Enrichment score (ratio of causal-to-common variants in peaks vs. background regions).

Notes

  • The function utilizes PyRanges for efficient genomic range operations.

  • Variants not overlapping the background regions or peaks are excluded from the enrichment calculation.

  • If no common variants are found within a gene’s background regions, that gene is skipped.

  • Enrichment is computed as follows:

    \(enrichment = \frac{n\_causal\_var\_in\_peak}{n\_causal\_var\_in\_background} / \frac{n\_common\_var\_in\_peak}{n\_common\_var\_in\_background}\)

  • When no common variants overlap peaks or causal variants overlap the background regions.

Example

>>> df_p2g = pd.DataFrame({
...     'peak': ['chr1-1000-2000', 'chr1-3000-4000'],
...     'gene': ['GeneA', 'GeneB']
... })
>>> causal_var = pd.DataFrame({
...     'Chromosome': ['chr1', 'chr1'],
...     'Start': [1500, 3500],
...     'End': [1501, 3501],
...     'Gene': ['GeneA', 'GeneB']
... })
>>> common_var = pd.DataFrame({
...     'Chromosome': ['chr1', 'chr1'],
...     'Start': [1600, 3600],
...     'End': [1601, 3601]
... })
>>> gene_gtf = "path/to/annotation.gtf"
>>> ref_fasta = "path/to/reference.fasta"
>>> result = causal_var_enrichment_in_peaks(
...     df_p2g, causal_var, common_var, gene_gtf, ref_fasta
... )
>>> print(result)
    gene  n_causal_var_in_peak  n_causal_var_in_background  n_common_var_in_peak  n_common_var_in_background  enrichment
0    GeneA                     1                    1                     1                    1    1.000000
1    GeneB                     1                    1                     1                    1    1.000000