从数据库中检索RNA UTR序列



我正在尝试从数据库中检索特定的RNA UTR序列。从数据库中,我得到了一个.dat文件,其中每个RNA UTR由这样的条目表示:

ID 5MMUR018955;SV 1;线性mrna;std;mus;54 bp。xxAC BR058092;xxDT 2009年7月1日(创建,创建)DT 2009年7月1日(Rel。9,最后更新,版本1)xxMus肌肉中性粒细胞颗粒蛋白(NGP)中的5'UTR,mRNA。xxRefseq博士;NM_008694;Utref博士;CR062409;Geneid博士;18054;xxOS Mus Musculus(鼠标鼠标)oc eukaryota;后生Chordata;craniata;椎骨;Euteleostomi;哺乳动物;oc eutheria;Euarchontoglires;毛刺;Rodentia;Sciulognathi;Muroidea;OC Muridae;穆里纳;mus;mus。xxut 5'utr;1外显子xxFH关键位置/预选赛FHft来源1..54ft/有机体=" mus musculus"ft/mol_type =" mRNA"ft/strain =" C57BL/6"ft/db_xref ="分类量:10090"ft 5'utr 1..54ft/source="refseq:: nm_008694:1..54"ft/gene =" ngp"ft/product ="嗜中性颗粒蛋白"ft/gene_synonynym =" bectenecin"ft/genome =" chr9:110322312-110322365: "xxSQ序列54 bp;19 a;9 C;14 g;12 t;0其他;     agtctcaata tcatctacat aaaaggggcc aagagtggta gtgtgtgtcaga gaca 54//

我有一个基因名称列表(存储在行FT /gene="Ngp"中),我想用来访问存储在行SQ Sequence 54 BP; 19 A; 9 C; 14 G; 12 T; 0 other; agtctcaata tcatctacat aaaaggggcc aagagtggta gtgtgtcaga gaca 54

中的序列

检索后,我想以fasta格式写入一个新文件,即

>Ngp
agtctcaatatcatctacataaaaggggccaagagtggtagtgtgtcagagaca"

有什么简单的方法可以在Python中做到这一点?我整天都在与之奋斗,还没有真正到处,并真的很感谢您的帮助。

这是embl格式:

ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt

您可以使用Biopython seqio从此提取信息:

http://biopython.org/wiki/seqio

这是一个可靠的解决方案,搜索数据库文件以获取基因列表,以fasta格式打印结果,并列出找不到的基因。

请注意,数据库中相同基因名称的序列可能有多个记录,因此您可能需要其他过滤才能获得您希望获得的那些序列。

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
data = "embl.dat"  #Path to EMBL database file
search = "gene_names.txt" #Path to file with search terms
#Load the search terms from file and strip linefeed characters
search_genes = open(search, 'r').read().splitlines() 
found_genes = []
#Search the EMBL database file
for record in SeqIO.parse(open(data, 'r'), 'embl'):
    UTR5 = [feature for feature in record.features if feature.type=="5'UTR"]
    for utr5feature in UTR5:
        for s in search_genes:
            genes = utr5feature.qualifiers['gene']
            if s in genes:
                found_genes.append(s)
                #Gene found. Print a modified copy of the record in the desired format
                print SeqRecord(record.seq, id="_".join(genes), name=record.name,
                                description=record.description).format('fasta')
#List any search terms that were not found in the database
for s in search_genes:
    if s not in found_genes:
        print s+" NOT FOUND IN DATABASE!"

i用生物thon来解析embl文件并提取信息

from Bio import SeqIO
input = "test.embl"  #change your input, here
#next if you had one sequence in the input file
seq = SeqIO.parse(open(input), "embl").next()
UTR5 = [feature for feature in seq.features if feature.type=="5'UTR"]
#you have only one 5'utr
genes = UTR5[0].qualifiers['gene']
#you get ['Ngp']
#Create SeqRecord
from Bio.SeqRecord import SeqRecord
#you may remove description, if not required
new_record = SeqRecord(seq.seq, id= "_".join(genes), 
             name=seq.name, description=seq.description)
print new_record.format("fasta")

您得到:

> NGP 5'UTR中性粒细胞颗粒蛋白(NGP),mRNA。agtctcaatatcatcatcataaaaaaaaggggccaagtggtggtgtgtgtcagagagaca

最新更新