从格式不同的字段中提取数据的最佳XPath实践

我使用的是Python 3.8、XPath和Scrapy，它们似乎都能正常工作。我认为XPath表达式是理所当然的。

现在，我必须使用Python3.8、XPath和lxml.html，而事情就不那么宽容了。例如，使用此URL和XPath:

//dt[text()='Services/Products']/following-sibling::dd[1]

我会根据innerhtml的内容返回一段或一个列表。这就是我现在试图提取文本的方式：

data = response.text
tree = html.fromstring(data)
Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")

返回以下内容：Services_Product[]其是"；李；元素，但其他时候该字段可以是以下任何一个：

<dd>some text</dd>
or
<dd><p>some text</p></dd>
or
<dd>
<ul>
<li>some text</li>
<li>some text</li>
</ul>
</dd>
or
<dd>
<ul>
<li><p>some text</p></li>
<li><p>some text</p></li>
</ul>
</dd>

在这种情况下，从目标字段可以是许多不同的东西，提取文本的最佳实践是什么？

我用这个测试代码来看看我的选择是什么：

file = open('html_01.txt', 'r')
data = file.read()
tree = html.fromstring(data)
Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")
stuff = Services_Product[0].xpath("//li")
for elem in stuff:
print(elem[0][0].text)

回复如下：健康健康医生健康医生

这是不对的。以下是谷歌chrome中的一张截图：谷歌chrome中的Xpath工具以及有问题的html

使用Python和Xpath或其他选项收集这些数据的最佳方法是什么？非常感谢。

花了几个小时在谷歌上搜索，然后在上面写了这篇文章，我突然想到：旧代码：

Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")
stuff = Services_Product[0].xpath("//li")

以及返回一个漂亮的文本列表的新代码：

Services_Product = tree.xpath("//dt[text()='Services/Products']/following-sibling::dd[1]")
stuff = Services_Product[0].xpath("//li/text()")

添加"/text(("最终修复了它。

相关内容

最新更新

热门标签：