我正在学习使用Python、Xpath和Scrapy进行web抓取。我被以下问题所困扰。如果你能帮助我,我将不胜感激。
这是HTML代码
<div class="discussionpost">
“This is paragraph one.”
<br>
<br>
“This is paragraph two."'
<br>
<br>
"This is paragraph three.”
</div>
这是我想要得到的输出:;这是第一段。这是第二段。这是第三段"我想把用<br>
分隔的所有段落合并起来。没有<p>
标记。
然而,我得到的结果是:"这是第一句&"这是第二句&"这是第三句">
这是我正在使用的代码:
sentences = response.xpath('//div[@class="discussionpost"]/text()').extract()
我理解上面的代码为什么会这样。但是,我无法更改它来做我需要做的事情。任何帮助都将不胜感激。
要获得所有文本节点值,必须调用//text()
而不是/text()
sentences = ' '.join(response.x`path('//div[@class="discussionpost"]//text()').extract()).strip()
通过碎屑外壳验证:
>>> from scrapy import Selector
>>> html_doc = '''
... <html>
... <body>
... <div class="discussionpost">
... “This is paragraph one.”
... <br/>
... <br/>
... “This is paragraph two."'
... <br/>
... <br/>
... "This is paragraph three.”
... </div>
... </body>
... </html>
...
... '''
>>> res = Selector(text=html_doc)
>>> res
<Selector xpath=None data='<html>n <body>n <div class="discussi...'>
>>> sentences = ''.join(res.xpath('//div[@class="discussionpost"]//text()').extract())
>>> sentences
'n “This is paragraph one.”n n n “This is paragraph two."'n n n "This is paragraph three.”n '
>>> txt = sentences
>>> txt
'n “This is paragraph one.”n n n “This is paragraph two."'n n n "This is paragraph three.”n '
>>> txt = sentences.replace('n','').replace("'",'').replace(' ','').replace("“",'').replace('”','').replace('"','').strip()
>>> txt
'This is paragraph one. This is paragraph two. This is paragraph three.'
>>>
更新:
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://www.ibsgroup.org/threads/hemorrhoids-as-cause-of-pain.363290/']
def parse(self, response):
for p in response.xpath('//*[@class="bbWrapper"]'):
yield {
'comment': ''.join(p.xpath(".//text()").getall()).strip()
}