从数据库中检索RNA UTR序列

我正在尝试从数据库中检索特定的RNA UTR序列。从数据库中，我得到了一个.dat文件，其中每个RNA UTR由这样的条目表示：

ID 5MMUR018955;SV 1;线性mrna;std;mus;54 bp。xxAC BR058092;xxDT 2009年7月1日（创建，创建）DT 2009年7月1日（Rel。9，最后更新，版本1）xxMus肌肉中性粒细胞颗粒蛋白（NGP）中的5'UTR，mRNA。xxRefseq博士；NM_008694;Utref博士；CR062409;Geneid博士；18054;xxOS Mus Musculus（鼠标鼠标）oc eukaryota;后生Chordata;craniata;椎骨；Euteleostomi;哺乳动物；oc eutheria;Euarchontoglires;毛刺；Rodentia;Sciulognathi;Muroidea;OC Muridae;穆里纳；mus;mus。xxut 5'utr;1外显子xxFH关键位置/预选赛FHft来源1..54ft/有机体=" mus musculus"ft/mol_type =" mRNA"ft/strain =" C57BL/6"ft/db_xref ="分类量：10090"ft 5'utr 1..54ft/source="refseq:: nm_008694:1..54"ft/gene =" ngp"ft/product ="嗜中性颗粒蛋白"ft/gene_synonynym =" bectenecin"ft/genome =" chr9：110322312-110322365： "xxSQ序列54 bp;19 a;9 C;14 g;12 t;0其他;     agtctcaata tcatctacat aaaaggggcc aagagtggta gtgtgtgtcaga gaca 54//

我有一个基因名称列表（存储在行FT /gene="Ngp"中），我想用来访问存储在行SQ Sequence 54 BP; 19 A; 9 C; 14 G; 12 T; 0 other; agtctcaata tcatctacat aaaaggggcc aagagtggta gtgtgtcaga gaca 54

中的序列

检索后，我想以fasta格式写入一个新文件，即

>Ngp
agtctcaatatcatctacataaaaggggccaagagtggtagtgtgtcagagaca"

有什么简单的方法可以在Python中做到这一点？我整天都在与之奋斗，还没有真正到处，并真的很感谢您的帮助。

这是embl格式：

ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt

您可以使用Biopython seqio从此提取信息：

http://biopython.org/wiki/seqio

这是一个可靠的解决方案，搜索数据库文件以获取基因列表，以fasta格式打印结果，并列出找不到的基因。

请注意，数据库中相同基因名称的序列可能有多个记录，因此您可能需要其他过滤才能获得您希望获得的那些序列。

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
data = "embl.dat"  #Path to EMBL database file
search = "gene_names.txt" #Path to file with search terms
#Load the search terms from file and strip linefeed characters
search_genes = open(search, 'r').read().splitlines() 
found_genes = []
#Search the EMBL database file
for record in SeqIO.parse(open(data, 'r'), 'embl'):
    UTR5 = [feature for feature in record.features if feature.type=="5'UTR"]
    for utr5feature in UTR5:
        for s in search_genes:
            genes = utr5feature.qualifiers['gene']
            if s in genes:
                found_genes.append(s)
                #Gene found. Print a modified copy of the record in the desired format
                print SeqRecord(record.seq, id="_".join(genes), name=record.name,
                                description=record.description).format('fasta')
#List any search terms that were not found in the database
for s in search_genes:
    if s not in found_genes:
        print s+" NOT FOUND IN DATABASE!"

i用生物thon来解析embl文件并提取信息

from Bio import SeqIO
input = "test.embl"  #change your input, here
#next if you had one sequence in the input file
seq = SeqIO.parse(open(input), "embl").next()
UTR5 = [feature for feature in seq.features if feature.type=="5'UTR"]
#you have only one 5'utr
genes = UTR5[0].qualifiers['gene']
#you get ['Ngp']
#Create SeqRecord
from Bio.SeqRecord import SeqRecord
#you may remove description, if not required
new_record = SeqRecord(seq.seq, id= "_".join(genes), 
             name=seq.name, description=seq.description)
print new_record.format("fasta")

您得到：

> NGP 5'UTR中性粒细胞颗粒蛋白（NGP），mRNA。agtctcaatatcatcatcataaaaaaaaggggccaagtggtggtgtgtgtcagagagaca

相关内容

最新更新

热门标签：