cell2net.preprocessing.peak_to_gene#

cell2net.preprocessing.peak_to_gene(mdata, rna_mod='rna', atac_mod='atac', gene_name_col='gene_names', up_stream=500000, down_stream=500000, ref_fasta='', chr_var_key='chr', start_var_key='start', end_var_key='end', highly_variable=False, genes=None, min_n_peaks=1, max_pct_dropout_by_counts=None, inplace=True)#

Link peaks to genes based on proximity to transcription start sites (TSS).

This function assigns ATAC-seq peaks to genes based on their proximity to transcription start sites (TSS) within specified upstream and downstream distances. The resulting mapping can be added to the uns attribute of the provided MuData object or returned as a DataFrame.

Parameters:
  • mdata (MuData) – A MuData object containing RNA and ATAC modalities.

  • rna_mod (str (default: 'rna')) – Name of the RNA modality in the MuData object.

  • atac_mod (str (default: 'atac')) – Name of the ATAC modality in the MuData object.

  • gene_name_col (str (default: 'gene_names')) – Column name in the RNA .var attribute that contains gene names.

  • up_stream (int (default: 500000)) – Distance upstream of the TSS to consider for assigning peaks.

  • down_stream (int (default: 500000)) – Distance downstream of the TSS to consider for assigning peaks.

  • ref_fasta (str (default: '')) – Path to the reference FASTA file for determining genome bounds. The file must be indexed.

  • chr_var_key (str (default: 'chr')) – Key in ATAC .var that contains chromosome names.

  • start_var_key (str (default: 'start')) – Key in ATAC .var that contains peak start positions.

  • end_var_key (str (default: 'end')) – Key in ATAC .var that contains peak end positions.

  • highly_variable (bool (default: False)) – If True, only consider highly variable genes.

  • genes (list[str] | None (default: None)) – Specific genes to include in the mapping. If None, all genes are considered.

  • min_n_peaks (int (default: 1)) – Minimum number of associated peaks required for a gene to be included.

  • max_pct_dropout_by_counts (float | None (default: None)) – Maximum percentage of dropout by counts for filtering genes. If None, no filtering is applied and all genes are used.

  • inplace (bool (default: True)) – If True, the resulting mapping is added to the uns attribute of the MuData object under the key “peak_to_gene”. If False, the mapping is returned as a DataFrame.

Return type:

DataFrame | None

Returns:

If inplace is False, returns a DataFrame with columns:

  • gene: Gene name.

  • peak: Peak identifier.

  • distance: Distance from the TSS to the peak summit.

Otherwise, modifies the MuData object in place.

Raises:

AssertionError – If gene TSS coordinates are not found in the RNA modality (adata_rna.uns[“gene_tss_coord”]).

Notes

  • Peaks are assigned to genes based on overlap with genomic regions defined by the upstream and downstream distances from the TSS.

  • Genes without any associated peaks are excluded from the output.

  • Peak summits are calculated as the midpoint of their start and end positions.

Examples

>>> from mudata import MuData
>>> import anndata as ad
>>> import pandas as pd
>>> import cell2net as cn
>>> mdata = MuData({
...     "rna": ad.AnnData(var=pd.DataFrame({"gene_names": ["gene1", "gene2"]})),
...     "atac": ad.AnnData(var=pd.DataFrame({
...         "chr": ["chr1", "chr1"],
...         "start": [100, 200],
...         "end": [150, 250]
...     }))
... })
>>> mdata["rna"].uns["gene_tss_coord"] = pd.DataFrame({
...     "gene_name": ["gene1", "gene2"],
...     "tss": [125, 225],
...     "strand": ["+", "-"],
...     "chrom": ["chr1", "chr1"]
... })
>>> df = peak_to_gene(mdata, ref_fasta="genome.fa", inplace=False)
>>> print(df.head())
    gene  peak  distance
0   gene1     0        25
1   gene2     1        25