刮《纽约时报》的每日词汇

我最近才开始进入Scrapy，我选择了纽约时报的每日词汇作为第一个测试。 https://www.nytimes.com/column/learning-word-of-the-day

我

注意到他们有一个 API，但对于我的确切情况，它没有我可以使用的东西（我认为）。我基本上希望浏览该页面上当天的每个单词，并检索单词，含义和示例段落。

这一小段代码应该遍历每个 url 并至少检索单词，但我收到了很多错误，我不知道为什么！我一直在使用SelectorGadget来获取我需要的CSS代码，到目前为止，这是我的代码：

import scrapy
class NewYorkSpider(scrapy.Spider):
    name = "times"
    start_urls = [ "https://www.nytimes.com/column/learning-word-of-the-day" ]
    # entry point for the spider
    def parse(self,response):
        for href in response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "headline", " " ))]'):
            url = href.extract()
            yield scrapy.Request(url, callback=self.parse_item)
    def parse_item(self, response):
        word = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "story-subheading", " " ))]//strong').extract()[0]

谢谢你，很多！

更新了错误（现在不完全是错误，只是没有抓取假定的信息）：

2017-01-18 01:13:48 [scrapy] DEBUG: Filtered duplicate request: <GET https://www.nytimes.com/column/%3Ch2%20class=%22headline%22%20itemprop=%22headline%22%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20Word%20+%20Quiz:%20spawn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3C/h2%3E> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2017-01-18 01:13:48 [scrapy] DEBUG: Crawled (404) <GET https://www.nytimes.com/column/%3Ch2%20class=%22headline%22%20itemprop=%22headline%22%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20Word%20+%20Quiz:%20spawn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3C/h2%3E> (referer: https://www.nytimes.com/column/learning-word-of-the-day)
2017-01-18 01:13:48 [scrapy] DEBUG: Crawled (404) <GET https://www.nytimes.com/column/%3Ch2%20class=%22headline%22%20itemprop=%22headline%22%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20Word%20+%20Quiz:%20introvert%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3C/h2%3E> (referer: https://www.nytimes.com/column/learning-word-of-the-day)
2017-01-18 01:13:48 [scrapy] DEBUG: Crawled (404) <GET https://www.nytimes.com/column/%3Ch2%20class=%22headline%22%20itemprop=%22headline%22%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20Word%20+%20Quiz:%20funereal%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3C/h2%3E> (referer: https://www.nytimes.com/column/learning-word-of-the-day)

您在.css方法中使用 xpath 表达式，该方法适用于 css 选择器表达式。
只需将.css替换为.xpath：

response.css('//*[contains(concat( " ", @class, " " ), concat( " ", "headline", " " ))]')
# to
response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "headline", " " ))]')

关于您的第二个错误 - 提取的 url 不是绝对 url，例如 /some/sub/page.html .要将其转换为绝对网址，您可以使用response.urljoin()函数：

 for href in response.xpath('...'):
    url = href.extract()
    full_url = response.urljoin(url)
    yield Request(full_url)

关于你的错误三 - 你的 xpath 在这里有问题。看起来你使用了一些xpath生成器，这些东西很少产生任何有价值的东西。您在这里寻找的只是一个具有story-link类的<a>节点：

urls = response.xpath('//a[@class="story-link"]/@href').extract()
for url in urls:
    yield Request(response.urljoin(full_url))

对于您的单词 xpath，您可以简单地在节点下使用文本：

word = response.xpath("//h4/strong/text()").extract_first()

这段代码应该可以工作。要从每个单词的网站获取所需的其他信息，您只需使用带有XPath或CSS表达式的适当选择器即可。

有关选择器的更多信息，我推荐这个网站，当然还有谷歌。

import scrapy
class NewYorkSpider(scrapy.Spider):
    name = "times"
    start_urls = ["https://www.nytimes.com/column/learning-word-of-the-day"]
    # entry point for the spider
    def parse(self,response):
        for href in response.css('a[class="story-link"]::attr(href)'):
            yield scrapy.Request(href.extract(), callback=self.parse_item)
    def parse_item(self, response):
        heading = response.css('h4[class="story-subheading story-content"] strong::text').extract_first()

相关内容

最新更新

热门标签：