我有一个包含信息的文件,它是这样的:
****ALIGNMENT****
Sequence: gi|86755972|gb|ABD15130.1| cold acclimation protein COR413-PM1 [Chimonanthus praecox]
Length: 201
E-value: 2.66576e-82
KYLAMKTDQLAVANMIDSDINELKMATMRLINDASMLGHYGFGTHFLKWLACLAAIYLLILDRTNWRTNMLTSLL...
+YLAMKTD+ + +I +D+ E+ A +L+ DA+ LG G GT LKW+A AAIYLLILDRTNW+TNMLT+LL...
EYLAMKTDEWSAQQLIQTDLKEMGKAAKKLVYDATKLGSLGVGTSILKWVASFAAIYLLILDRTNWKTNMLTALL...
现在我想过滤一些信息,我想将其用作变量。我认为我应该为此使用正则表达式,但我不知道如何使用第二行的大量信息来做到这一点。
我需要hitsid
、protein
、organism
和evalue
。
相应的数据:
hitsid = 86755972
protein = cold acclimation protein COR413-PM1
organism = Chimonanthus praecox
evalue = 2.66576e-82
所以我希望,当我要求hitsid
时,Python打印"86755972
"。
谁能帮我解决这个问题?谢谢!
使用正则表达式,例如
^Sequence:[^|]*|(?P<hitsid>[^|]*)|S*s*(?P<protein>[^][]*?)s*[(?P<organism>[^][]*)][sS]*?nE-value:s*(?P<evalue>.*)
查看正则表达式演示
将多个值放入字典列表的示例 Python 代码:
import re
p = re.compile(r'^Sequence:[^|]*|(?P<hitsid>[^|]*)|S*s*(?P<protein>[^][]*?)s*[(?P<organism>[^][]*)][sS]*?nE-value:s*(?P<evalue>.*)', re.MULTILINE)
s = "****ALIGNMENT****nSequence: gi|86755972|gb|ABD15130.1| cold acclimation protein COR413-PM1 [Chimonanthus praecox]nLength: 201nE-value: 2.66576e-82nKYLAMKTDQLAVANMIDSDINELKMATMRLINDASMLGHYGFGTHFLKWLACLAAIYLLILDRTNWRTNMLTSLL...n+YLAMKTD+ + +I +D+ E+ A +L+ DA+ LG G GT LKW+A AAIYLLILDRTNW+TNMLT+LL...nEYLAMKTDEWSAQQLIQTDLKEMGKAAKKLVYDATKLGSLGVGTSILKWVASFAAIYLLILDRTNWKTNMLTALL..."
res = [m.groupdict() for m in p.finditer(s)]
for x in res:
print(x['hitsid'])
print(x['protein'])
print(x['organism'])
print(x['evalue'])