cell2net.preprocessing.add_dna_sequence#

cell2net.preprocessing.add_dna_sequence(mdata, ref_fasta, mod_name='atac', chr_var_key='chr', start_var_key='start', end_var_key='end', sequence_var_key='dna_sequence')#

Add sequences to peak metadata in a MuData object.

This function retrieves DNA sequences for genomic regions specified in the .var attribute of the AnnData object within a MuData object. The sequences are fetched from a reference FASTA file and added as metadata under the specified key.

Parameters:

mdata (MuData) – A MuData object containing the modality with peak metadata.
ref_fasta (str) – Path to the reference FASTA file. This file must be indexed (e.g., with samtools faidx).
mod_name (str (default: 'atac')) – The name of the modality containing peak data. Defaults to “atac”.
chr_var_key (str (default: 'chr')) – The key in .var that contains chromosome names. Defaults to “chr”.
start_var_key (str (default: 'start')) – The key in .var that contains the start positions of peaks. Defaults to “start”.
end_var_key (str (default: 'end')) – The key in .var that contains the end positions of peaks. Defaults to “end”.
sequence_var_key (str (default: 'dna_sequence')) – The key under which the retrieved DNA sequences will be stored in .var. Defaults to “dna_sequence”.

Return type:

None

Returns:

None The function modifies the MuData object in place by adding DNA sequences to the specified key in the .var attribute.

Raises:

AssertionError – If the specified modality (mod_name) is not found in the MuData object.
FileNotFoundError – If the ref_fasta file does not exist or is not properly indexed.

Examples

>>> from mudata import MuData
>>> import anndata as ad
>>> import pandas as pd
>>> import cell2net as cn
>>> data = ad.AnnData(var=pd.DataFrame({
...     "chr": ["chr1", "chr2"],
...     "start": [100, 200],
...     "end": [150, 250]
... }))
>>> mdata = MuData({"atac": data})
>>> cn.pp.add_dna_sequence(mdata, ref_fasta="reference.fasta")
>>> print(mdata["atac"].var["dna_sequence"])
0    ATCGTTGAC...
1    TGGCCAATA...