按正则表达式过滤 txt 文件的信息



我有一个包含信息的文件,它是这样的:

****ALIGNMENT****
Sequence:  gi|86755972|gb|ABD15130.1| cold acclimation protein COR413-PM1 [Chimonanthus praecox]
Length:  201
E-value:  2.66576e-82
KYLAMKTDQLAVANMIDSDINELKMATMRLINDASMLGHYGFGTHFLKWLACLAAIYLLILDRTNWRTNMLTSLL...
+YLAMKTD+ +   +I +D+ E+  A  +L+ DA+ LG  G GT  LKW+A  AAIYLLILDRTNW+TNMLT+LL...
EYLAMKTDEWSAQQLIQTDLKEMGKAAKKLVYDATKLGSLGVGTSILKWVASFAAIYLLILDRTNWKTNMLTALL...

现在我想过滤一些信息,我想将其用作变量。我认为我应该为此使用正则表达式,但我不知道如何使用第二行的大量信息来做到这一点。

我需要hitsidproteinorganismevalue

相应的数据:

hitsid = 86755972
protein = cold acclimation protein COR413-PM1
organism = Chimonanthus praecox
evalue = 2.66576e-82

所以我希望,当我要求hitsid时,Python打印"86755972"。

谁能帮我解决这个问题?谢谢!

使用正则表达式,例如

^Sequence:[^|]*|(?P<hitsid>[^|]*)|S*s*(?P<protein>[^][]*?)s*[(?P<organism>[^][]*)][sS]*?nE-value:s*(?P<evalue>.*)

查看正则表达式演示

将多个值放入字典列表的示例 Python 代码:

import re
p = re.compile(r'^Sequence:[^|]*|(?P<hitsid>[^|]*)|S*s*(?P<protein>[^][]*?)s*[(?P<organism>[^][]*)][sS]*?nE-value:s*(?P<evalue>.*)', re.MULTILINE)
s = "****ALIGNMENT****nSequence:  gi|86755972|gb|ABD15130.1| cold acclimation protein COR413-PM1 [Chimonanthus praecox]nLength:  201nE-value:  2.66576e-82nKYLAMKTDQLAVANMIDSDINELKMATMRLINDASMLGHYGFGTHFLKWLACLAAIYLLILDRTNWRTNMLTSLL...n+YLAMKTD+ +   +I +D+ E+  A  +L+ DA+ LG  G GT  LKW+A  AAIYLLILDRTNW+TNMLT+LL...nEYLAMKTDEWSAQQLIQTDLKEMGKAAKKLVYDATKLGSLGVGTSILKWVASFAAIYLLILDRTNWKTNMLTALL..."
res = [m.groupdict() for m in p.finditer(s)]
for x in res:
    print(x['hitsid'])
    print(x['protein'])
    print(x['organism'])
    print(x['evalue'])

最新更新