抓取抓取页面和 Supage,但仅抓取一个项目



我的蜘蛛有问题。我尝试按照一些教程来更好地理解刮擦,并扩展教程以抓取子页面。我的蜘蛛的问题是它只抓取入口页面的一个元素,而不是页面上应该的 25 个元素。

我不知道失败在哪里。也许你们中的某个人可以在这里帮助我:

from datetime import datetime as dt
import scrapy
from reddit.items import RedditItem
class PostSpider(scrapy.Spider):
    name = 'post'
    allowed_domains = ['reddit.com']
    def start_requests(self):
        reddit_urls = [
            ('datascience', 'week')
        ]
        for sub, period in reddit_urls:
            url = 'https://www.reddit.com/r/' + sub + '/top/?sort=top&t=' + period
            yield scrapy.Request(url=url, callback=self.parse)
    def parse(self, response):
        # get the subreddit from the URL
        sub = response.url.split('/')[4]
        # parse thru each of the posts
        for post in response.css('div.thing'):
            item = RedditItem()
            item['title'] = post.css('a.title::text').extract_first()
            item['commentsUrl'] = post.css('a.comments::attr(href)').extract_first()
            ### scrap comments page.
            request = scrapy.Request(url=item['commentsUrl'], callback=self.parse_comments)
            request.meta['item'] = item
            return request

    def parse_comments(self, response):
        item = response.meta['item']
        item['commentsText'] = response.css('div.comment div.md p::text').extract()
        self.logger.info('Got successful response from {}'.format(response.url))
        yield item

感谢您的帮助。BR

感谢您的评论:事实上,我必须屈服于请求,而不是返回请求。现在它正在工作。

最新更新