基于元素值提取XML属性值



我在OpenRefine中有几个单元格,其中有一些XML(来自nomitim(,对于每个节点,只有当同一节点中元素的值等于特定字符串("Paris"(时,我才想提取属性的值。我使用jython在元素上循环,如果元素值等于Paris,则返回所需的属性。这是它的代码:

from xml.etree import ElementTree as ET
element = ET.fromstring(value).encode('utf8')
root = element.getroot()
resultsList = root.findall(".//place")
for result in resultsList:
typerecord = result.find("city")
if typerecord.text == "Paris":
return result.attrib["lat"]

然而,它似乎不起作用,即使代码对我来说很好。我得到以下错误:

Error: Traceback (most recent call last):
File "<string>", line 3, in __temp_242115945__
File "/opt/openrefine/webapp/extensions/jython/module/MOD-INF/lib/jython-standalone-2.7.2.jar/Lib/xml/etree/ElementTree.py", line 1313, in XML
File "/opt/openrefine/webapp/extensions/jython/module/MOD-INF/lib/jython-standalone-2.7.2.jar/Lib/xml/etree/ElementTree.py", line 1653, in feed
File "/opt/openrefine/webapp/extensions/jython/module/MOD-INF/lib/jython-standalone-2.7.2.jar/Lib/xml/etree/ElementTree.py", line 1653, in feed
File "/opt/openrefine/webapp/extensions/jython/module/MOD-INF/lib/jython-standalone-2.7.2.jar/Lib/xml/parsers/expat.py", line 193, in Parse
UnicodeEncodeError: 'ascii' codec can't encode character u'xa9' in position 115: ordinal not in range(128)

这似乎更多地是关于字符的编码。我在脚本中添加了.encode('utf8'),但没有任何更改。

这里有一个XML:的示例

<?xml version="1.0" encoding="UTF-8" ?>
<searchresults attribution="Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright" exclude_place_ids="18482590,103398643,118557459,109798886" more_url="https://nominatim.openstreetmap.org/search/?street=11+rue+Girardon&amp;city=Paris&amp;country=France&amp;addressdetails=1&amp;extratags=1&amp;polygon_geojson=1&amp;exclude_place_ids=18482590%2C103398643%2C118557459%2C109798886&amp;format=xml" querystring="11 rue Girardon, Paris, France" timestamp="Tue, 25 Oct 22 09:32:26 +0000">
<place address_rank="30" boundingbox="43.6242386,43.6243386,1.4264894,1.4265894" class="place" display_name="11, Rue François Girardon, Minimes - Barrière de Paris, Toulouse Nord, Toulouse, Haute-Garonne, Occitanie, France métropolitaine, 31200, France" geojson="{&quot;type&quot;:&quot;Point&quot;,&quot;coordinates&quot;:[1.4265394,43.6242886]}" importance="0.5201" lat="43.6242886" lon="1.4265394" osm_id="2084506137" osm_type="node" place_id="18482590" place_rank="30" type="house">
<extratags/>
<house_number>11</house_number>
<road>Rue François Girardon</road>
<neighbourhood>Minimes - Barrière de Paris</neighbourhood>
<suburb>Toulouse Nord</suburb>
<city>Toulouse</city>
<municipality>Toulouse</municipality>
<county>Haute-Garonne</county>
<ISO3166-2-lvl6>FR-31</ISO3166-2-lvl6>
<state>Occitanie</state>
<ISO3166-2-lvl4>FR-OCC</ISO3166-2-lvl4>
<region>France métropolitaine</region>
<postcode>31200</postcode>
<country>France</country>
<country_code>fr</country_code>
</place>
<place address_rank="26" boundingbox="48.8872626,48.8876471,2.3372233,2.3374922" class="highway" display_name="Rue Girardon, Quartier des Grandes-Carrières, Paris 18e Arrondissement, Paris, Île-de-France, France métropolitaine, 75018, France" geojson="{&quot;type&quot;:&quot;LineString&quot;,&quot;coordinates&quot;:[[2.3372233,48.8872626],[2.3372534,48.8873072],[2.337453,48.8875915],[2.3374922,48.8876471]]}" importance="0.52" lat="48.8875915" lon="2.337453" osm_id="10662867" osm_type="way" place_id="103398643" place_rank="26" type="residential">
<extratags>
<tag key="lit" value="yes"/>
<tag key="surface" value="sett"/>
<tag key="maxspeed" value="30"/>
<tag key="sidewalk" value="both"/>
<tag key="smoothness" value="intermediate"/>
<tag key="cycleway:both" value="no"/>
<tag key="zone:maxspeed" value="FR:30"/>
<tag key="motor_vehicle:conditional" value="no @ (Su,PH 11:00-18:00)"/>
</extratags>
<road>Rue Girardon</road>
<city_block>Quartier des Grandes-Carrières</city_block>
<suburb>Paris 18e Arrondissement</suburb>
<city_district>Paris</city_district>
<city>Paris</city>
<ISO3166-2-lvl6>FR-75</ISO3166-2-lvl6>
<state>Île-de-France</state>
<ISO3166-2-lvl4>FR-IDF</ISO3166-2-lvl4>
<region>France métropolitaine</region>
<postcode>75018</postcode>
<country>France</country>
<country_code>fr</country_code>
</place>
<place address_rank="26" boundingbox="48.8885135,48.8886689,2.3380551,2.3381062" class="highway" display_name="Rue Girardon, Quartier des Grandes-Carrières, Paris 18e Arrondissement, Paris, Île-de-France, France métropolitaine, 75018, France" geojson="{&quot;type&quot;:&quot;LineString&quot;,&quot;coordinates&quot;:[[2.3381062,48.8885135],[2.3380648,48.8886091],[2.3380551,48.8886689]]}" importance="0.52" lat="48.8886091" lon="2.3380648" osm_id="23371363" osm_type="way" place_id="109798886" place_rank="26" type="pedestrian">
<extratags>
<tag key="lit" value="yes"/>
<tag key="surface" value="paving_stones"/>
<tag key="smoothness" value="good"/>
</extratags>
<road>Rue Girardon</road>
<city_block>Quartier des Grandes-Carrières</city_block>
<suburb>Paris 18e Arrondissement</suburb>
<city_district>Paris</city_district>
<city>Paris</city>
<ISO3166-2-lvl6>FR-75</ISO3166-2-lvl6>
<state>Île-de-France</state>
<ISO3166-2-lvl4>FR-IDF</ISO3166-2-lvl4>
<region>France métropolitaine</region>
<postcode>75018</postcode>
<country>France</country>
<country_code>fr</country_code>
</place>
</searchresults>

给定样本和代码,我所期望的结果是:

48.8875915
48.8886091

有谁能为它提供帮助或建议一些GREL替代方案吗?

就我个人而言,我发现在OpenRefine的Jython预览中调试非平凡的Python是一件非常痛苦的事情,因为GREL流畅的风格更容易增量构建,所以这里有一个适用于Python的GREL等价物:

forEach(value.parseXml().select('place'),p,if(p.select('city')[0].htmlText()=='Paris',p.htmlAttr('lat'),None)).join('|')

它返回48.8875915|48.8886091(不能在单元格中存储数组(

话虽如此,您的Python有两个问题:

  • 您需要对字符串进行编码,而不是对fromstring()返回的值进行编码,即ET.fromstring(value.encode('utf8'))而不是ET.fromstring(value).encode('utf8')
  • ElementTree.fromstring()直接返回根元素,因此不需要getroot()

修补后的代码如下,但请注意,它只返回第一个值。它需要额外的修改才能返回字符串中连接在一起的所有匹配项。

from xml.etree import ElementTree as ET
root = ET.fromstring(value.encode('utf8'))
resultsList = root.findall(".//place")
for result in resultsList:
typerecord = result.find("city")
if typerecord.text == "Paris":
return result.attrib["lat"]

最新更新