我的代码:
GFF = raw_input("Please enter gff3 file: ")
GFF = open(GFF, "r")
GFF= GFF.read()
new_dict = {}
for i in GFF:
element = i.split()
if (element[2] == "five_prime_UTR"):
if element[7] in new_dict:
new_dict[element[2]]+= 1
if element[3] in new_dict:
new_dict[element[3]] += 1
element[2] == "five_prime_UTR"
索引超出范围
我如何为geneid(如Zm00001d027231(及其五素数utr区域编号(如50887(创建字典。我一直试图做到这一点,首先追平五个主要的utr区域,然后从那里开始。
期望输出
new_dict ={Zm00001d027231:50887}
gff3文件是一个基因注释文件。它看起来是这样的:
1 gramene exon 55222 55682 . - . Parent=transcript:Zm00001d027231_T003;Name=Zm00001d027231_T003.exon1;constitutive=0;ensembl_end_phase=0;ensembl_phase=-1;exon_id=Zm00001d027231_T003.exon1;rank=1
1 gramene five_prime_UTR 55549 55682 . - . Parent=transcript:Zm00001d027231_T003
1 gramene mRNA 50887 55668 . - . ID=transcript:Zm00001d027231_T004;Parent=gene:Zm00001d027231;biotype=protein_coding;transcript_id=Zm00001d027231_T004
1 gramene three_prime_UTR 50887 51120 . - . Parent=transcript:Zm00001d027231_T004
1 gramene exon 50887 51239 . - . Parent=transcript:Zm00001d027231_T004;Name=Zm00001d027231_T004.exon9;constitutive=0;ensembl_e
变量GFF
保存gff3文件的内容。
现在,您正在按字符循环文件的内容
>>> for i in GFF:
>>> print(i)
1
g
r
a
m
e
n
e
e
x
o
n
[and so on]
您想使用for i in GFF.splitlines():
逐行循环文件的内容
您还可以使代码更加清晰,为正在解析的字段命名,例如:
new_dict = {}
# https://m.ensembl.org/info/website/upload/gff3.html
gff3_fields = ['seqid', # name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seq ID must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
'source', # name of the program that generated this feature, or the data source (database or project name)
'type', # type of feature. Must be a term or accession from the SOFA sequence ontology
'start', # Start position of the feature, with sequence numbering starting at 1.
'end', # End position of the feature, with sequence numbering starting at 1.
'score', # A floating point value.
'strand', # defined as + (forward) or - (reverse).
'phase', # One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
'attributes' # A semicolon-separated list of tag-value pairs, providing additional information about each feature. Some of these tags are predefined, e.g. ID, Name, Alias, Parent
]
for line in GFF.splitlines():
feature = dict(zip(gff3_fields, line.split()))
if feature['type'] == 'three_prime_UTR':
attributes = feature['attributes']
geneid = attributes.split(':')[-1].split('_')[0]
new_dict[geneid] = feature['start']