cell2net.preprocessing.add_variants_to_sequence#

cell2net.preprocessing.add_variants_to_sequence(mdata, atac_mod='atac', n_cpus=1, sample_col_key='bestSample', sequence_var_key='dna_sequence', variants_key='variants', seq_with_variants_key='seq_with_variants', inplace=True)#

Add genomic variants to DNA sequences from peak regions to generate personalized haplotype sequences.

This function takes a MuData object containing ATAC-seq data, reference DNA sequences for peak regions, and variant information. It applies the variants to the sequences per sample to produce haplotype-specific (seq_1 and seq_2) updated sequences reflecting individual genotypes. Supports single-core or multi-core processing.

Parameters:

mdata (MuData) – A MuData object containing the modality with ATAC-seq data and variant information.
atac_mod (str (default: 'atac')) – Name of the modality in mdata that contains the ATAC-seq data.
n_cpus (int (default: 1)) – Number of CPU cores to use. If >1, uses multiprocessing to parallelize across samples.
sample_col_key (str (default: 'bestSample')) – The name of the column in adata.obs that identifies sample IDs.
sequence_var_key (str (default: 'dna_sequence')) – The name of the column in adata.var that contains the reference DNA sequence for each peak.
variants_key (str (default: 'variants')) – The key in adata.uns that stores the variant information (as a DataFrame), including columns: - ‘peak’: peak ID - ‘sample’: sample ID - ‘pos’: variant position - ‘ref’: reference allele - ‘alt’: alternate allele - ‘genotype’: genotype (1 for het, 2 for hom-alt)
seq_with_variants_key (str (default: 'seq_with_variants')) – The key under which to store the resulting DataFrame in adata.uns, containing haplotype-aware sequences.
inplace (bool (default: True)) – If True, the resulting DataFrame is stored in adata.uns[seq_with_variants_key]. If False, the function returns the DataFrame directly.

Return type:

None | DataFrame

Returns:

Returns None if inplace=True. Otherwise, returns a DataFrame with updated haplotype sequences: - ‘peak’: peak ID - ‘sample’: sample ID - ‘seq_1’: sequence with genotype 1 or 2 applied to haplotype 1 - ‘seq_2’: sequence with genotype 1 or 2 applied to haplotype 2

Notes

Each variant is applied to its corresponding peak and sample-specific sequence.
Assumes DNA sequences are 0-based Python strings and variant positions are 1-based.
For heterozygous (1) genotypes, only seq_2 is updated.
For homozygous alternate (2) genotypes, both seq_1 and seq_2 are updated.
Peaks and sample combinations are expanded into a full grid for processing.
Performance can be improved with parallel execution using multiple CPUs.
Requires the helper function update_sequence_with_variants().

Raises:

Logs errors if: –

The specified modality or keys are not found in the MuData object. - The reference allele in the sequence does not match the variant’s reference base.