RNA Splicing Python



我有一个基因序列–

"acguccgcaagagaagccuuaauauauucaaaaagcuacgccucagauuucgcgcucgagcccaaaacaacugguguacggguugaucacaucaaaugaagucgcuaaagucggugaucucacuauccuugucuucggcuuuugcucucucggcuaucaucuaagcaggcgaguuccauggugaccggaacgacggcuacuggaguccaugaucgcaagcgucgggcugggguaaaagaggcucagcucauaauaguccgccccaccaguacgggacucgauaggccccgucguugccguagaaacgcaauuuuccucagacccacuauacgcaccucgauuuagcaugguuccgggguugcgcuuugagaaucauacguaaggaucggaaccuaggaaugcaccacagaacuuugaaauacuagaacaaguugauugacaacggaguaucggcgccccacauuuaacgaauaauugcaggcgccagacgaugcuaggugcguccguaucaagauucgaggucgcuacuggcuucgcuugccgaucgagcucagaguuugugagaguuguuacuaauugcguggucgccuaauauccuugauacuacguggguguacuagacaucccggacagaaaaucucuuaaacgcuagaguucucuuggaagcgccugcacuucuugugaacauacgaugauagccacucuaagcccaacgcacuucgcuuggcccacauugcccccagagcuuauucaucgacaggcguuccacucuuggauucaucaguaaacuuuauuauacgugguaagcgugcuuauagcugucggaaucucacuuaggcggauugaagugagacagccugaaaguaaccguguacaggcgccgucaauguguuuugagugugcaccuacaaaaaguguuauuuaggcaggggagcuuuguaguuucuuuagaagagccgcgaaugaaccaacgguagacugcgagcgcguucaaccuaau"

我想剪接RNA并提取两个列表(外显子和内含子(。关键是RNA的内含子部分以gu开始,以ag结束。然而,如果ag出现在gu之前,则它是外显子的一部分,而不是内含子。

def splice(sequence):
introns = list()
exons = list()
while(sequence.count("gu")):
if "gu" not in sequence:
break
else:    
exons.append(sequence[:sequence.find("gu")])
sequence = sequence[sequence.find("gu"):]
if "ag" not in sequence:
break
else:
introns.append(sequence[:sequence.find("ag")+2])
sequence = sequence[sequence.find("ag")+2:]
return introns, exons

这就是我到目前为止所拥有的。它进展得很顺利,但问题开始于gu出现时的末尾,而剩余字符串中没有ag

输出:

Exons:
['ac',
'agaagccuuaauauauucaaaaagcuacgccucagauuucgcgcucgagcccaaaacaacug',
'ucgcuaaa',
'caggcga',
'uccaugaucgcaagc',
'aggcucagcucauaaua',
'uacgggacucgauaggcccc',
'aaacgcaauuuuccucagacccacuauacgcaccucgauuuagcaug',
'aaucauac',
'gaucggaaccuaggaaugcaccacagaacuuugaaauacuagaacaa',
'uaucggcgccccacauuuaacgaauaauugcaggcgccagacgaugcuag',
'auucgag',
'cucaga',
'a',
'acaucccggacagaaaaucucuuaaacgcuaga',
'cgccugcacuucuu',
'ccacucuaagcccaacgcacuucgcuuggcccacauugcccccagagcuuauucaucgacaggc',
'uaaacuuuauuauac',
'c',
'cu',
'gcggauugaa',
'acagccugaaa',
'gcgcc',
'u',
'u',
'gcaggggagcuuu',
'uuucuuuagaagagccgcgaaugaaccaacg',
'acugcgagcgc']
Introns:
['guccgcaag',
'guguacggguugaucacaucaaaugaag',
'gucggugaucucacuauccuugucuucggcuuuugcucucucggcuaucaucuaag',
'guuccauggugaccggaacgacggcuacuggag',
'gucgggcugggguaaaag',
'guccgccccaccag',
'gucguugccguag',
'guuccgggguugcgcuuugag',
'guaag',
'guugauugacaacggag',
'gugcguccguaucaag',
'gucgcuacuggcuucgcuugccgaucgag',
'guuugugag',
'guuguuacuaauugcguggucgccuaauauccuugauacuacguggguguacuag',
'guucucuuggaag',
'gugaacauacgaugauag',
'guuccacucuuggauucaucag',
'gugguaag',
'gugcuuauag',
'gucggaaucucacuuag',
'gugag',
'guaaccguguacag',
'gucaauguguuuugag',
'gugcaccuacaaaaag',
'guuauuuag',
'guag',
'guag']

我使用正则表达式修复了查询。

def splice(gene_Sequence): 
regex = r"gu(?:w{0,}?)ag" 
introns = re.findall(regex, gene_Sequence) 
for intron in introns: 
exon = gene_Sequence.replace(intron, "") 
return introns, exon

最新更新