cell2net.preprocessing.add_variants_to_sequence#

cell2net.preprocessing.add_variants_to_sequence(mdata, atac_mod='atac', n_cpus=1, sample_col_key='bestSample', sequence_var_key='dna_sequence', variants_key='variants', seq_with_variants_key='seq_with_variants', inplace=True)#

Add genomic variants to DNA sequences from peak regions to generate personalized haplotype sequences.

This function takes a MuData object containing ATAC-seq data, reference DNA sequences for peak regions, and variant information. It applies the variants to the sequences per sample to produce haplotype-specific (seq_1 and seq_2) updated sequences reflecting individual genotypes. Supports single-core or multi-core processing.

Parameters:
  • mdata (MuData) – A MuData object containing the modality with ATAC-seq data and variant information.

  • atac_mod (str (default: 'atac')) – Name of the modality in mdata that contains the ATAC-seq data.

  • n_cpus (int (default: 1)) – Number of CPU cores to use. If >1, uses multiprocessing to parallelize across samples.

  • sample_col_key (str (default: 'bestSample')) – The name of the column in adata.obs that identifies sample IDs.

  • sequence_var_key (str (default: 'dna_sequence')) – The name of the column in adata.var that contains the reference DNA sequence for each peak.

  • variants_key (str (default: 'variants')) – The key in adata.uns that stores the variant information (as a DataFrame), including columns: - ‘peak’: peak ID - ‘sample’: sample ID - ‘pos’: variant position - ‘ref’: reference allele - ‘alt’: alternate allele - ‘genotype’: genotype (1 for het, 2 for hom-alt)

  • seq_with_variants_key (str (default: 'seq_with_variants')) – The key under which to store the resulting DataFrame in adata.uns, containing haplotype-aware sequences.

  • inplace (bool (default: True)) – If True, the resulting DataFrame is stored in adata.uns[seq_with_variants_key]. If False, the function returns the DataFrame directly.

Return type:

None | DataFrame

Returns:

Returns None if inplace=True. Otherwise, returns a DataFrame with updated haplotype sequences: - ‘peak’: peak ID - ‘sample’: sample ID - ‘seq_1’: sequence with genotype 1 or 2 applied to haplotype 1 - ‘seq_2’: sequence with genotype 1 or 2 applied to haplotype 2

Notes

  • Each variant is applied to its corresponding peak and sample-specific sequence.

  • Assumes DNA sequences are 0-based Python strings and variant positions are 1-based.

  • For heterozygous (1) genotypes, only seq_2 is updated.

  • For homozygous alternate (2) genotypes, both seq_1 and seq_2 are updated.

  • Peaks and sample combinations are expanded into a full grid for processing.

  • Performance can be improved with parallel execution using multiple CPUs.

  • Requires the helper function update_sequence_with_variants().

Raises:

Logs errors if:

  • The specified modality or keys are not found in the MuData object. - The reference allele in the sequence does not match the variant’s reference base.