cell2net.preprocessing.dinucleotide_shuffle_one_hot#

cell2net.preprocessing.dinucleotide_shuffle_one_hot(one_hot)#

Shuffle a one-hot encoded DNA sequence while preserving its dinucleotide composition.

This function converts a one-hot encoded DNA sequence into its nucleotide representation, shuffles it while maintaining the same dinucleotide composition, and then converts the shuffled sequence back into one-hot encoding.

Parameters:

one_hot (ndarray) – A 2D array of shape (L, 4), where L is the sequence length, and each row is a one-hot encoded nucleotide. Each row should contain exactly one 1 and three 0s, corresponding to the nucleotides “A”, “C”, “G”, and “T”.

Return type:

ndarray

Returns:

A 2D array of shape (L, 4) representing the shuffled sequence in one-hot encoding. The dinucleotide composition of the original sequence is preserved.

Notes

  • The function assumes the input sequence is valid one-hot encoding. Behavior is undefined if the input contains invalid rows.

  • Shuffling is performed on the nucleotide sequence derived from the one-hot input, and the shuffled sequence is converted back to one-hot encoding.

  • The function uses the dinucleotide_shuffle helper function to handle the shuffling of the nucleotide sequence.

Examples

>>> import numpy as np
>>> import cell2net as cn
>>> import random
>>> random.seed(42)
>>> one_hot_sequence = np.array([
...     [1, 0, 0, 0],  # A
...     [0, 1, 0, 0],  # C
...     [0, 0, 1, 0],  # G
...     [0, 0, 0, 1]   # T
... ])
>>> shuffled_one_hot = cn.pp.dinucleotide_one_hot_shuffle(one_hot_sequence)
>>> shuffled_one_hot
array([[0., 1., 0., 0.],  # "C"
       [1., 0., 0., 0.],  # "A"
       [0., 0., 0., 1.],  # "T"
       [0., 0., 1., 0.]]) # "G"