cell2net.preprocessing.add_variants_to_sequence#
- cell2net.preprocessing.add_variants_to_sequence(mdata, atac_mod='atac', n_cpus=1, sample_col_key='bestSample', sequence_var_key='dna_sequence', variants_key='variants', seq_with_variants_key='seq_with_variants', inplace=True)#
Add genomic variants to DNA sequences from peak regions to generate personalized haplotype sequences.
This function takes a MuData object containing ATAC-seq data, reference DNA sequences for peak regions, and variant information. It applies the variants to the sequences per sample to produce haplotype-specific (seq_1 and seq_2) updated sequences reflecting individual genotypes. Supports single-core or multi-core processing.
- Parameters:
mdata (
MuData
) – A MuData object containing the modality with ATAC-seq data and variant information.atac_mod (
str
(default:'atac'
)) – Name of the modality in mdata that contains the ATAC-seq data.n_cpus (
int
(default:1
)) – Number of CPU cores to use. If >1, uses multiprocessing to parallelize across samples.sample_col_key (
str
(default:'bestSample'
)) – The name of the column in adata.obs that identifies sample IDs.sequence_var_key (
str
(default:'dna_sequence'
)) – The name of the column in adata.var that contains the reference DNA sequence for each peak.variants_key (
str
(default:'variants'
)) – The key in adata.uns that stores the variant information (as a DataFrame), including columns: - ‘peak’: peak ID - ‘sample’: sample ID - ‘pos’: variant position - ‘ref’: reference allele - ‘alt’: alternate allele - ‘genotype’: genotype (1 for het, 2 for hom-alt)seq_with_variants_key (
str
(default:'seq_with_variants'
)) – The key under which to store the resulting DataFrame in adata.uns, containing haplotype-aware sequences.inplace (
bool
(default:True
)) – If True, the resulting DataFrame is stored in adata.uns[seq_with_variants_key]. If False, the function returns the DataFrame directly.
- Return type:
- Returns:
Returns None if inplace=True. Otherwise, returns a DataFrame with updated haplotype sequences: - ‘peak’: peak ID - ‘sample’: sample ID - ‘seq_1’: sequence with genotype 1 or 2 applied to haplotype 1 - ‘seq_2’: sequence with genotype 1 or 2 applied to haplotype 2
Notes
Each variant is applied to its corresponding peak and sample-specific sequence.
Assumes DNA sequences are 0-based Python strings and variant positions are 1-based.
For heterozygous (1) genotypes, only seq_2 is updated.
For homozygous alternate (2) genotypes, both seq_1 and seq_2 are updated.
Peaks and sample combinations are expanded into a full grid for processing.
Performance can be improved with parallel execution using multiple CPUs.
Requires the helper function update_sequence_with_variants().
- Raises:
Logs errors if: –
The specified modality or keys are not found in the MuData object. - The reference allele in the sequence does not match the variant’s reference base.