cell2net.preprocessing.update_sequence_with_variants#

cell2net.preprocessing.update_sequence_with_variants(df_seq, df_variants)#

Update reference DNA sequences with genomic variants based on genotype information.

This function modifies a pair of haplotype sequences (seq_1 and seq_2) stored in df_seq, using variant data from df_variants. For each variant, the reference allele is checked against the sequence at the given position, and if matched, the sequence is updated depending on the genotype:

  • Genotype 1 (heterozygous): update seq_2 with the ALT allele

  • Genotype 2 (homozygous alt): update both seq_1 and seq_2 with the ALT allele

Parameters:
  • df_seq (DataFrame) –

    A DataFrame containing the reference sequences for each (peak, sample) pair.

    Must include the following columns: - ‘peak’: unique peak identifier - ‘sample’: sample identifier - ‘start’: start genomic coordinate of the peak - ‘seq_1’: reference haplotype 1 sequence - ‘seq_2’: reference haplotype 2 sequence

  • df_variants (DataFrame) –

    A DataFrame containing variant information to apply.

    Must include the following columns: - ‘peak’: peak ID the variant overlaps - ‘sample’: sample ID - ‘pos’: genomic position of the variant (1-based) - ‘ref’: reference allele - ‘alt’: alternate allele - ‘genotype’: genotype code (1 for het, 2 for hom-alt)

Return type:

DataFrame

Returns:

A new DataFrame with the same structure as df_seq, but with updated seq_1 and seq_2 sequences based on the input variants.

Raises:

AssertionError – If the reference base in the sequence does not match the provided ‘ref’ allele at the variant position.

Notes

  • Assumes df_seq is uniquely indexed by (‘peak’, ‘sample’).

  • Assumes all positions in df_variants fall within the corresponding peak interval.

  • Index is temporarily set during processing and restored at the end.