砂纸:尝试下载索引中的每个链接作为完整的HTML文件的失败

我正在尝试访问索引中的每个链接，并在HTML中保存相应的页面。我试图将使用LinkeXtractor的使用与完整页面结合使用 - 结合这两种方法：scrapy crapery scrape scrape网页并将内容保存为html文件，并下载带有scrapy的完整页面

但是，我正在生成一个指向定义parse_item函数的错误（第17行）。我相信这与第18行有关（？）。

当我在单个URL上使用它时，解析功能正常工作，但是当我尝试将其合并到linkextractor中时。

我的蜘蛛py代码如下：

import scrapy
import urlparse
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class EasySpider(CrawlSpider):
    name = 'easy'
    allowed_domains = ['web']
    start_urls = ['http://www.example.com/index.html']
    rules = (
        Rule(LinkExtractor(restrict_xpaths='//*[@class="foobar"]//a/@href'), 
             callback='parse_item')
    )
def parse_item(self, response):
    filename = urlparse.urljoin(response.url, url)
    with open(filename, 'wb') as f:
        f.write(response.body)
    return

这是因为语法问题，还是我需要创建/修改项目。我可以肯定的是，我在urlparse组件上做错了什么，但是我尝试过的任何变化都没有通过错误。

任何帮助将不胜感激。问候，

您的问题是parse_item不在班级内部，而是外部。因此，它不会成为蜘蛛的一部分

import scrapy
import urlparse
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class EasySpider(CrawlSpider):
    name = 'easy'
    allowed_domains = ['web']
    start_urls = ['http://www.example.com/index.html']
    rules = (
        Rule(LinkExtractor(restrict_xpaths='//*[@class="foobar"]//a'), 
             callback='parse_item'), 
    )
    def parse_item(self, response):
       filename = "index.html"
       with open(filename, 'wb') as f:
           f.write(response.body)
       return

相关内容

最新更新

热门标签：