抓取一些Facebook数据，但不是全部?刮擦/飞溅/蟒蛇

我有一只蜘蛛看起来像这样：

import scrapy
from scrapy_splash import SplashRequest
class BarkbotSpider(scrapy.Spider):
    name = 'barkbot'
    start_urls = [
        'http://www.facebook.com/pg/TheBarkFL/events/?ref=page_internal/'
    ]
    custom_settings = {
        'FEED_URI': 'output/barkoutput.json'
    }
    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(
                url,
                self.parse,
            )
    def parse(self, response):
        for href in response.css("div#upcoming_events_card a::attr(href)").extract():
            yield response.follow(href, self.parse_concert)
    def parse_concert(self, response):
        concert = {
            "headliner" : response.xpath(
                "//h1[@id='seo_h1_tag']/text()"
            ).extract_first(),
            "venue" : "The Bark",
            "venue_address" : "507 All Saints St.",
            "venue_website" : "https://www.facebook.com/TheBarkFL",
            "date_time" : response.xpath(
                "//li[@id='event_time_info']//text()"
            ).extract(),
            "notes" : response.xpath(
                "//div[@data-testid='event-permalink-details']/span/text()"
            ).extract()
        }
        if concert['headliner']:
            yield concert

我运行蜘蛛，它成功完成。但是所有"注释"和"date_time"键返回的都是空列表。我对注释特别困惑，因为这似乎相当简单，除非 xpath 不能使用 data-testid 作为属性。但是，我正在成功抓取头条新闻键，所以我显然连接到每个页面。

我是抓取JavaScript生成的内容和Splash的新手，但是我已经设法让另一个蜘蛛成功工作，只是不在Facebook上。什么给？

除非 XPaPass 不能使用 data-testid 作为属性

不，不是这样;我刚刚检查了 Scrapy 1.5.1，您的 xpath 与示例文档匹配良好。它甚至与该文档中的其他data-testid属性匹配，因此我很确定您遇到了竞争条件，因为event-permalink-details没有出现在 HTML 中;它从 XHR 调用加载到他们的 GraphQL 端点。在您的情况下，由于 Splash，这可能很好，但是如果您的选择器不匹配，则该选择器在 XHR 解析之前正在运行。我不知道足够的 Splash 来帮助解决这种情况。

我不知道你date_time问题的答案，但我实际上敢打赌你真正想要的是.xpath('//li[@id="event_time_info"]//@content')，因为它包含2019-01-03T17:30:00-08:00 to 2019-01-03T20:30:00-08:00似乎比不合格的字符串匹配的字符串要好得多text()

相关内容

最新更新

热门标签：