我正在尝试以下代码从DNA序列中确定最长的ORF
from Bio import Seq
import regex as re
startP = re.compile('ATG')
nuc = input_seq.replace('n','')
longest = (0,)
for m in startP.finditer(nuc, overlapped=True):
if len(Seq.Seq(nuc)[m.start():].translate(to_stop=True)) > longest[0]:
pro = Seq.Seq(nuc)[m.start():].translate(to_stop=True)
longest = (len(pro),
m.start(),
str(pro),
nuc[m.start():m.start()+len(pro)*3+3])
但问题是它只在3个正向链中找到ORF,而不迎合反向DNA链。要获取给定序列的反向补码,我有这个脚本
# Swap characters
ReverseCompDNA = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}
def reverse_complement(seq):
"""
reverse_complement function takes the single argument as sequence
return the reverse+complement ordered.
"""
# Swapping A with T and G with C vice versa. Reversing newly generated string
# https://www.geeksforgeeks.org/python-docstrings/
return ''.join([ReverseCompDNA[nuc] for nuc in seq])[::-1]
我希望我的代码在过滤功能后同时向前"常规输入和反向输入,并在比较6个orf的DNA序列中的所有orf后给我确切的输出。
您正在使用bioppython,因此不需要实现自己的reverse_complement
函数,请使用Seq
模块中的方法。
在bioppython烹饪书中有一个关于识别开放阅读框架的示例,我们可以根据它来找到最长的ORF:
from Bio import SeqIO
record = SeqIO.read("NC_005816.fna", "fasta")
table = 11 # This is a bacterial sequence, so using NCBI codon table 11. Adapt to your own needs
longest = 0
orf = None
strand = None
frame = None
for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
for frame in range(3):
length = 3 * ((len(record)-frame) // 3) #Multiple of three
for pro in nuc[frame:frame+length].translate(table).split("*"):
if len(pro) > longest:
longest = len(pro)
orf = pro
strand = strand
frame = frame
print(f"Longest ORF is {orf[:30]}...{orf[-3:]} with length {longest} at strand {strand}, frame {frame}.")
在NC_005816的情况下。如果打印:
Longest ORF is RGIFMSDTMVVNGSGGVPAFLFSGSTLSSY...LLK with length 361 at strand -1, frame 2.