cell2net.preprocessing.fragments_to_coverage#

cell2net.preprocessing.fragments_to_coverage(df_fragments, chrom_sizes, normalize=True, scaling_factor=1.0, cut_sites=False, extend_cut_sites=0)#

Convert fragment data to genome coverage signal.

This function processes fragment data and generates genome coverage or cut-site signal, which can be used for creating BigWig files or similar outputs.

Parameters:
  • df_fragments (DataFrame) – A Polars DataFrame containing fragment data. Must include the columns: ‘Chromosome’, ‘Start’, and ‘End’.

  • chrom_sizes (dict[str, int]) – Dictionary mapping chromosome names to their respective sizes.

  • normalize (bool (default: True)) – If True, normalize the coverage values to Reads Per Million (RPM). Default is True.

  • scaling_factor (float (default: 1.0)) – A scaling factor to apply to the signal values. Only used if normalize is True. Default is 1.0.

  • cut_sites (bool (default: False)) – Use 1 bp Tn5 cut sites (start and end of each fragment) instead of whole fragment length for coverage calculation.

  • extend_cut_sites (int (default: 0)) – If set cut_sites, expand cut sites for both upstream and downstream, by default: 0

Yields:

A tuple containing

  • chroms (numpy.ndarray): Chromosome names for each coverage interval.

  • starts (numpy.ndarray): Start positions of coverage intervals.

  • ends (numpy.ndarray): End positions of coverage intervals.

  • values (numpy.ndarray): Signal values for each coverage interval.

Notes

  • The df_fragments DataFrame is partitioned by chromosome for efficient processing.

  • The chrom_sizes dictionary defines the size of each chromosome and is used to initialize arrays.

  • If cut_sites is True, the coverage is computed at the fragment boundaries rather than the entire fragment range.

  • Normalization scales the signal to RPM, and an additional scaling factor can further adjust the signal values.

Examples

>>> import polars as pl
>>> import cell2net as cn
>>> df_fragments = pl.DataFrame(
...     {"Chromosome": ["chr1", "chr1", "chr2"], "Start": [100, 200, 300], "End": [150, 250, 350]}
... )
>>> chrom_sizes = {"chr1": 1000, "chr2": 500}
>>> results = cn.pp.fragments_to_coverage(df_fragments, chrom_sizes, normalize=False)
>>> for chroms, starts, ends, values in results:
...     print(chroms, starts, ends, values)