cell2net.preprocessing.get_gene_tss_coord#
- cell2net.preprocessing.get_gene_tss_coord(gene_gtf, feature_type='gene')#
Extract transcription start site (TSS) coordinates for genes from a GTF file.
This function parses a GTF (Gene Transfer Format) file to extract the transcription start site (TSS) of genes or other specified features. It returns a pandas DataFrame with the chromosome, gene name, strand, and TSS for each gene.
- Parameters:
- Return type:
- Returns:
A DataFrame containing the following columns:
chrom: Chromosome name (str).
gene_name: Gene name extracted from the “gene_name” attribute (str).
strand: Strand of the gene (‘+’ or ‘-’) (str).
tss: Transcription start site position (int).
Notes
This function assumes that the “gene_name” attribute is present in the GTF file’s attributes field and is enclosed in double quotes.
The TSS is calculated as the start position for ‘+’ strand genes and the end position for ‘-’ strand genes.
Lines in the GTF file that do not have 9 columns or do not match the specified feature type are skipped.
Duplicate gene names are removed, keeping only the first occurrence.
Examples
>>> Extract TSS information for genes from a GTF file: >>> import pandas as pd >>> gene_gtf = "path/to/genes.gtf.gz" >>> df = get_gene_tss_coor(gene_gtf) >>> print(df.head()) chrom gene_name strand tss 0 chr1 GeneA + 1000 1 chr1 GeneB - 2000 2 chr2 GeneC + 3000 3 chr2 GeneD - 4000