xpath for img src within element

我将如何修改下面的代码，以便它挑选出在包含html的描述元素中找到的任何图像的来源？目前它只是从元素内部获取全文，我不确定如何修改它以获取任何 img 标签的来源。

>>> from lxml import etree
>>> tree = etree.parse('temp.xml')
>>> for guide in tree.xpath('guide'):
...     '---', guide.xpath('id')[0].text
...     for pages in guide.xpath('.//pages'):
...         for page in pages:
...             '------', page.xpath('id')[0].text
...             for description in page.xpath('.//asset/description'):
...                 '---------', description.text

我最后也试过这个：

print(description.xpath("//img/@src"))

这给了我"无"

XML 结构为：

<guides>
<guide>
    <id>guide 1</id>
    <group>
    <id></id> 
    <type></type>
    <name></name>
    </group>
    <pages>
        <page>
            <id>page 1</id>
            <name></name>
            <description>&lt;p&gt;Some text. &lt;br /&gt;&lt;img 
            width=&quot;81&quot; 
            src=&quot;http://www.example.com/img.jpg&quot; 
             alt=&quot;wave&quot; height=&quot;63&quot; style=&quot;float: 
              right;&quot; /&gt;&lt;/p&gt;</description>
            <boxes>
                <box>
                    <id></id>
                    <name></name>
                    <type></type>
                    <map_id></map_id>
                    <column></column>
                    <position></position>
                    <hidden></hidden>
                    <created></created>
                    <updated></updated>
                    <assets>
                        <asset>
                            <id></id>
                            <name></name>
                            <type></type>
                       <description>&lt;img src=&quot;https://www.example.com/image.jpg&quot; alt=&quot;image&quot; height=&quot;42&quot; width=&quot;42&quot;&gt;</description>
                            <url/>
                            <owner>
                                <id></id>
                                <email></email>
                                <first_name></first_name>
                                <last_name></last_name>
                            </owner>
                        </asset>
                    </assets>
                </box>
            </boxes>
        </page>
    </pages>
</guide>

description元素的内容是 HTML。有多种解析方法，其中一种是从lxml html

。

>>> description.text
'<img src="https://www.example.com/image.jpg" alt="image" height="42" width="42">'
>>> from lxml import html
>>> img = html.fromstring(description.text)
>>> img.attrib['src']
'https://www.example.com/image.jpg'

编辑，以回应评论：

>>> from lxml import etree, html
>>> tree = etree.parse('temp.xml')
>>> for guide in tree.xpath('guide'):
...     '---', guide.xpath('id')[0].text
...     for pages in guide.xpath('.//pages'):
...         for page in pages:
...             '------', page.xpath('id')[0].text
...             for description in page.xpath('.//asset/description'):
...                 '---------', html.fromstring(description.text).attrib['src']
... 
('---', 'guide 1')
('------', 'page 1')
('---------', 'https://www.example.com/image.jpg')

编辑：处理异常。

取代

'---------', html.fromstring(description.text).attrib['src']

跟

try:
    '---------', html.fromstring(description.text).attrib['src']
except KeyError:
    '--------- No image URL present'

编辑，回复11月9日的评论：

from lxml import etree, html
tree = etree.parse('guides.xml')
for guide in tree.xpath('guide'):
    print('---', guide.xpath('id')[0].text)
    for pages in guide.xpath('.//pages'):
        for page in pages:
            print('------', page.xpath('id')[0].text)
            for description in page.xpath('.//asset/description'):
                try:
                    print('---------', html.fromstring(description.text).attrib['src'])
                except TypeError:
                    print('--------- no src identifiable')
                except KeyError:
                    print('--------- no src identifiable')

xml 文件的输出，其中第二个引导元素根本不包含 HTML，第三个包含没有 src 属性的 HTML。

--- guide 1
------ page 1
--------- https://www.example.com/image.jpg
--- guide 2
------ page 1
--------- no src identifiable
--- guide 3
------ page 1
--------- no src identifiable
--- guide 4
------ page 1
--------- https://www.example.com/image.jpg

您可以尝试以下解决方案：

description.xpath("//img/@src")

相关内容

最新更新

热门标签：