cell2net.preprocessing.update_sequence_with_variants#
- cell2net.preprocessing.update_sequence_with_variants(df_seq, df_variants)#
Update reference DNA sequences with genomic variants based on genotype information.
This function modifies a pair of haplotype sequences (seq_1 and seq_2) stored in df_seq, using variant data from df_variants. For each variant, the reference allele is checked against the sequence at the given position, and if matched, the sequence is updated depending on the genotype:
Genotype 1 (heterozygous): update seq_2 with the ALT allele
Genotype 2 (homozygous alt): update both seq_1 and seq_2 with the ALT allele
- Parameters:
df_seq (
DataFrame) –- A DataFrame containing the reference sequences for each (peak, sample) pair.
Must include the following columns: - ‘peak’: unique peak identifier - ‘sample’: sample identifier - ‘start’: start genomic coordinate of the peak - ‘seq_1’: reference haplotype 1 sequence - ‘seq_2’: reference haplotype 2 sequence
df_variants (
DataFrame) –- A DataFrame containing variant information to apply.
Must include the following columns: - ‘peak’: peak ID the variant overlaps - ‘sample’: sample ID - ‘pos’: genomic position of the variant (1-based) - ‘ref’: reference allele - ‘alt’: alternate allele - ‘genotype’: genotype code (1 for het, 2 for hom-alt)
- Return type:
- Returns:
A new DataFrame with the same structure as df_seq, but with updated seq_1 and seq_2 sequences based on the input variants.
- Raises:
AssertionError – If the reference base in the sequence does not match the provided ‘ref’ allele at the variant position.
Notes
Assumes df_seq is uniquely indexed by (‘peak’, ‘sample’).
Assumes all positions in df_variants fall within the corresponding peak interval.
Index is temporarily set during processing and restored at the end.