嵌套兄弟姐妹的串联文本(如果与父节点中的文本一起使用)



我正在尝试从兄弟姐妹中提取文本,如果可用,并与父节点中的文本相连。如何在XPATH中做到这一点?下面显示的我的HTML几乎没有<sup> and <sub>的实例。

我的预期输出:

['2','1/2']

应该像这样的['<sup>'+'/'+ '<sub>']

连接
<li data-ingredient="dry+white+wine">
 <span class="qty">2 </span>
 <span class="food">
     "cups"  
     <a href="https://www.test.com">dry white wine</a>
 </span>
</li>
<li data-ingredient="salt">
 <span class="qty">
     <sup>1</sup>
     "⁄"
     <sub>2</sub>
 </span>
 <span class="food"> teaspoon  <a href="https://www.test.com">salt</a>
 </span>
</li>

我尝试了以下命令,并转介了多个砂纸文档。但无法提取所需的信息。

response.xpath('//span[@class="qty"][sup and sub]/text()').extract()
response.xpath('//span[@class="qty"]//sub/text()').extract()

我的想法是通过 span.qty迭代,从那里提取文本并加入它。喜欢这里:

txt = """<li data-ingredient="dry+white+wine">
...  <span class="qty">2 </span>
...  <span class="food">
...      "cups"  
...      <a href="https://www.test.com">dry white wine</a>
...  </span>
... </li>
... <li data-ingredient="salt">
...  <span class="qty">
...      <sup>1</sup>
...      "⁄"
...      <sub>2</sub>
...  </span>
...  <span class="food"> teaspoon  <a href="https://www.test.com">salt</a>
...  </span>
... </li>"""
>>> from scrapy import Selector
>>> sel = Selector(text=txt)
>>> for qty in sel.css('span.qty'):
...     print ''.join([i.replace('"', '').strip() for i in qty.css('::text').extract()])
... 
2
1⁄2

尝试BS4完成此类任务:

from bs4 import BeautifulSoup
html = response.xpath("//li[@data-ingredient='salt']/span[@class='qty']").extract()
soup = BeautifulSoup( html, "html.parser" ).get_text()

最新更新