我正在尝试从兄弟姐妹中提取文本,如果可用,并与父节点中的文本相连。如何在XPATH中做到这一点?下面显示的我的HTML几乎没有<sup> and <sub>
的实例。
我的预期输出:
['2','1/2']
应该像这样的['<sup>'+'/'+ '<sub>']
<li data-ingredient="dry+white+wine">
<span class="qty">2 </span>
<span class="food">
"cups"
<a href="https://www.test.com">dry white wine</a>
</span>
</li>
<li data-ingredient="salt">
<span class="qty">
<sup>1</sup>
"⁄"
<sub>2</sub>
</span>
<span class="food"> teaspoon <a href="https://www.test.com">salt</a>
</span>
</li>
我尝试了以下命令,并转介了多个砂纸文档。但无法提取所需的信息。
response.xpath('//span[@class="qty"][sup and sub]/text()').extract()
response.xpath('//span[@class="qty"]//sub/text()').extract()
我的想法是通过 span.qty
迭代,从那里提取文本并加入它。喜欢这里:
txt = """<li data-ingredient="dry+white+wine">
... <span class="qty">2 </span>
... <span class="food">
... "cups"
... <a href="https://www.test.com">dry white wine</a>
... </span>
... </li>
... <li data-ingredient="salt">
... <span class="qty">
... <sup>1</sup>
... "⁄"
... <sub>2</sub>
... </span>
... <span class="food"> teaspoon <a href="https://www.test.com">salt</a>
... </span>
... </li>"""
>>> from scrapy import Selector
>>> sel = Selector(text=txt)
>>> for qty in sel.css('span.qty'):
... print ''.join([i.replace('"', '').strip() for i in qty.css('::text').extract()])
...
2
1⁄2
尝试BS4完成此类任务:
from bs4 import BeautifulSoup
html = response.xpath("//li[@data-ingredient='salt']/span[@class='qty']").extract()
soup = BeautifulSoup( html, "html.parser" ).get_text()