检索分裂值在一个XML文件中



我有一个这样的XML文件:

data = '''<dbReference type="PDB" id="6LVN">
<property type="method" value="X-ray"/>
<property type="resolution" value="2.47 A"/>
<property type="chains" value="A/B/C/D=1168-1203"/>
</dbReference>
<dbReference type="PDB" id="6LXT">
<property type="method" value="X-ray"/>
<property type="resolution" value="2.90 A"/>
<property type="chains" value="A/B/C/D/E/F=910-988, A/B/C/D/E/F=1162-1206"/>
</dbReference>
<dbReference type="PDB" id="6LXV">
<property type="method" value="X-ray"/>
<property type="resolution" value="4.90 A"/>
<property type="chains" value="A/B/C/=210-488, A/B/C/=510-688, A/B/C=800-960"/>
</dbReference>'''

我想检索所有长度值。我的代码:

from bs4 import BeautifulSoup
xml_file = BeautifulSoup(data, 'lxml')
pdbs_xml = xml_file.find_all('dbreference', {'type': 'PDB'})
if len(pdbs_xml) != 0:
for item in pdbs_xml:
if item.find('property'):
id_ = item['id']
chains = item.find('property', {'type': 'chains'})
chains2 = chains['value']
if chains2.find(",")!= -1:
count = chains2.count(',')
if count >= 2:
chains = chains['value'].split('=')[count]
chains = chains.split(',')[0]
first_aa = chains.split('-')[0]
last_aa = chains.split('-')[1]
size_pdb = int(last_aa) - int(first_aa)
else:
chains = chains['value'].split('=')[2]
first_aa = chains.split('-')[0]
last_aa = chains.split('-')[1]
size_pdb = int(last_aa) - int(first_aa)
else:
chains = chains['value'].split('=')[1]
first_aa = chains.split('-')[0]
last_aa = chains.split('-')[1]
size_pdb = int(last_aa) - int(first_aa)

可以看到,有些值是分开的。理论上,我可以创建一条语句来预测每种可能性并检索所有情况(我知道我的代码现在还不能完全做到这一点),但是有一种更好的方法来实现这一点。所以,任何建议都是欢迎的。

在单个xpath中查找定义条件下的所有@id, @value属性,并在列表中的偶数位置处理@value。数学是直接对@valueeval进行计算,然后乘以-1,因为结果是负的。不需要分割和交换

获取@idxpath部分
//dbReference[@type="PDB" and property[@type="chains" and string-length(@value)>0]]/@id

获取@valuexpath部分
//dbReference[@type="PDB" and property[@type="chains" and string-length(@value)>0]]/property[@type="chains"]/@value

from lxml import etree
tree = etree.parse('test.xml')
steps = tree.xpath('//dbReference[@type="PDB" and property[@type="chains" and string-length(@value)>0]]/@id | //dbReference[@type="PDB" and property[@type="chains" and string-length(@value)>0]]/property[@type="chains"]/@value')
for i in range(len(steps)):
# @value appear on even positions
if (i%2) != 0:
items = steps[i].split(',')
s=0
for item in items:
values = item.split('=')
s+=eval(values[1])*(-1)

print(steps[i-1],s)

结果:

6LVN 35
6LXT 122
6LXV 616

开始了[注意下面的代码不需要任何外部库]

import xml.etree.ElementTree as ET
from collections import defaultdict
data = '''<r><dbReference type="PDB" id="6LVN">
<property type="method" value="X-ray"/>
<property type="resolution" value="2.47 A"/>
<property type="chains" value="A/B/C/D=1168-1203"/>
</dbReference>
<dbReference type="PDB" id="6LXT">
<property type="method" value="X-ray"/>
<property type="resolution" value="2.90 A"/>
<property type="chains" value="A/B/C/D/E/F=910-988, A/B/C/D/E/F=1162-1206"/>
</dbReference>
<dbReference type="PDB" id="6LXV">
<property type="method" value="X-ray"/>
<property type="resolution" value="4.90 A"/>
<property type="chains" value="A/B/C/=210-488, A/B/C/=510-688, A/B/C=800-960"/>
</dbReference></r>'''
sizes = defaultdict(int)
root = ET.fromstring(data)
for ref in root.findall('.//dbReference'):
pdb = ref.attrib['id']
chains = ref.find('property[@type="chains"]')
value = chains.attrib['value']
parts = value.split(',')
for part in parts:
left,right = part.split('=')
_left,_right = right.split('-')
sizes[pdb] += int(_right)- int(_left)
print(sizes)

输出
defaultdict(<class 'int'>, {'6LVN': 35, '6LXT': 122, '6LXV': 616})

使用bs4查找值,使用regex获得区间,然后使用内置函数求和每个区间的差值。

data = '''<dbReference type="PDB" id="6LVN">
<property type="method" value="X-ray"/>
<property type="resolution" value="2.47 A"/>
<property type="chains" value="A/B/C/D=1168-1203"/>
</dbReference>
<dbReference type="PDB" id="6LXT">
<property type="method" value="X-ray"/>
<property type="resolution" value="2.90 A"/>
<property type="chains" value="A/B/C/D/E/F=910-988, A/B/C/D/E/F=1162-1206"/>
</dbReference>
<dbReference type="PDB" id="6LXV">
<property type="method" value="X-ray"/>
<property type="resolution" value="4.90 A"/>
<property type="chains" value="A/B/C/=210-488, A/B/C/=510-688, A/B/C=800-960"/>
</dbReference>'''
from bs4 import BeautifulSoup
import re
xml_file = BeautifulSoup(data, 'lxml')
output = {}
for tag in xml_file.find_all(type="chains", value=True):
interval = re.findall(r'([0-9]+-[0-9]+)', tag['value'])
output[tag.parent['id']] = sum(map(lambda p: abs(int(p[1])-int(p[0])), (map(lambda p: p.split('-'), interval))))
print(output)

输出
{'6LVN': 35, '6LXT': 122, '6LXV': 616}

最新更新