我将如何修改下面的代码,以便它挑选出在包含html的描述元素中找到的任何图像的来源?目前它只是从元素内部获取全文,我不确定如何修改它以获取任何 img 标签的来源。
>>> from lxml import etree
>>> tree = etree.parse('temp.xml')
>>> for guide in tree.xpath('guide'):
... '---', guide.xpath('id')[0].text
... for pages in guide.xpath('.//pages'):
... for page in pages:
... '------', page.xpath('id')[0].text
... for description in page.xpath('.//asset/description'):
... '---------', description.text
我最后也试过这个:
print(description.xpath("//img/@src"))
这给了我"无"
XML 结构为:
<guides>
<guide>
<id>guide 1</id>
<group>
<id></id>
<type></type>
<name></name>
</group>
<pages>
<page>
<id>page 1</id>
<name></name>
<description><p>Some text. <br /><img
width="81"
src="http://www.example.com/img.jpg"
alt="wave" height="63" style="float:
right;" /></p></description>
<boxes>
<box>
<id></id>
<name></name>
<type></type>
<map_id></map_id>
<column></column>
<position></position>
<hidden></hidden>
<created></created>
<updated></updated>
<assets>
<asset>
<id></id>
<name></name>
<type></type>
<description><img src="https://www.example.com/image.jpg" alt="image" height="42" width="42"></description>
<url/>
<owner>
<id></id>
<email></email>
<first_name></first_name>
<last_name></last_name>
</owner>
</asset>
</assets>
</box>
</boxes>
</page>
</pages>
</guide>
description
元素的内容是 HTML。有多种解析方法,其中一种是从lxml
html
>>> description.text
'<img src="https://www.example.com/image.jpg" alt="image" height="42" width="42">'
>>> from lxml import html
>>> img = html.fromstring(description.text)
>>> img.attrib['src']
'https://www.example.com/image.jpg'
编辑,以回应评论:
>>> from lxml import etree, html
>>> tree = etree.parse('temp.xml')
>>> for guide in tree.xpath('guide'):
... '---', guide.xpath('id')[0].text
... for pages in guide.xpath('.//pages'):
... for page in pages:
... '------', page.xpath('id')[0].text
... for description in page.xpath('.//asset/description'):
... '---------', html.fromstring(description.text).attrib['src']
...
('---', 'guide 1')
('------', 'page 1')
('---------', 'https://www.example.com/image.jpg')
编辑:处理异常。
取代
'---------', html.fromstring(description.text).attrib['src']
跟
try:
'---------', html.fromstring(description.text).attrib['src']
except KeyError:
'--------- No image URL present'
编辑,回复11月9日的评论:
from lxml import etree, html
tree = etree.parse('guides.xml')
for guide in tree.xpath('guide'):
print('---', guide.xpath('id')[0].text)
for pages in guide.xpath('.//pages'):
for page in pages:
print('------', page.xpath('id')[0].text)
for description in page.xpath('.//asset/description'):
try:
print('---------', html.fromstring(description.text).attrib['src'])
except TypeError:
print('--------- no src identifiable')
except KeyError:
print('--------- no src identifiable')
xml 文件的输出,其中第二个引导元素根本不包含 HTML,第三个包含没有 src 属性的 HTML。
--- guide 1
------ page 1
--------- https://www.example.com/image.jpg
--- guide 2
------ page 1
--------- no src identifiable
--- guide 3
------ page 1
--------- no src identifiable
--- guide 4
------ page 1
--------- https://www.example.com/image.jpg
您可以尝试以下解决方案:
description.xpath("//img/@src")