从div中抓取数据,如页面所示



我正试图从这个URL中抓取数据https://eksisozluk.com/mortingen-sitraze--1277239,我想先把标题刮一下,然后把所有的评论都刮到标题下面。如果你打开网站,你会看到标题下的第一条评论是(bkz:mortingen)。问题是(bkz在一个div中,而在div中mortingen位于一个锚链接中,因此很难像网站上显示的那样抓取数据。有人能帮助我使用CSS Selector或Xpath吗?它可以如图所示抓取所有评论。我的代码写在下面,但它在三个单独的列中而不是一个中给了我(bkz:在一列中,然后akhisar然后)

def parse(self, response):
data={}
#count=0
title = response.css('[itemprop="name"]::text').get()
#data["Title"] = title
count=0
data["title"] = title
count=0
for content in response.css('li .content ::text'):
text = content.get()
text=text.strip()
content = "content" +str(count)
data[content] = text
count=count+1
yield data

您应该首先获得所有没有::text.content,并使用for-循环单独处理每个.content。对于每个.content,您应该运行::text以仅获取该内容中的所有文本,将其放在列表中,然后将其加入到单个字符串中

for count, content in enumerate(response.css('li .content')):
text = []
# get all `::text` in current `.content`
for item in content.css('::text'):
item = item.get()#.strip()
# put on list
text.append(item)
# join all items in single string
text = "".join(text)
text = text.strip()
print(count, '|', text)
data[f"content {count}"] = text

最小工作代码。

您可以将所有代码放在一个文件中并运行python script.py,而无需在scrapy中创建项目。

import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://eksisozluk.com/mortingen-sitraze--1277239']
def parse(self, response):
print('url:', response.url)
data = {}  # PEP8: spaces around `=`
title = response.css('[itemprop="name"]::text').get()
data["title"] = title
for count, content in enumerate(response.css('li .content')):
text = []
for item in content.css('::text'):
item = item.get()#.strip()
text.append(item)
text = "".join(text)
text = text.strip()
print(count, '|', text)
data[f"content {count}"] = text
yield data

# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
})
c.crawl(MySpider)
c.start()

编辑:

getall()稍短

for count, content in enumerate(response.css('li .content')):
text = content.css('::text').getall()
text = "".join(text)
text = text.strip()
print(count, '|', text)
data[f"content {count}"] = text

最新更新