Dynamic start_urls value



我是scrapy和python的新手。我已经编写了一个爬行器,可以很好地使用初始化的start_urls值。

如果我在Init的代码中加入一个文字作为

也可以正常工作

{自我。Start_urls = 'http://something.com'}

但是,当我从json文件中读取值并创建一个列表时,我得到了关于丢失%20

的相同错误

我觉得我在scrapy或python中错过了一些明显的东西,因为我是一个数字。

class SiteFeedConstructor(CrawlSpider, FeedConstructor):
    name = "Data_Feed"
    start_urls = ['http://www.cnn.com/']
    def __init__(self, *args, **kwargs):
    FeedConstructor.__init__(self, **kwargs)
    kwargs = {}
    super(SiteFeedConstructor, self).__init__(*args, **kwargs)
    self.name = str(self.config_json.get('name', 'Missing value'))
    self.start_urls = str(self.config_json.get('start_urls', 'Missing value'))
    self.start_urls = self.start_urls.split(",")
错误:

Traceback (most recent call last):
  File "/usr/bin/scrapy", line 4, in <module>
    execute()
  File "/usr/lib/python2.7/dist-packages/scrapy/cmdline.py", line 132, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/usr/lib/python2.7/dist-packages/scrapy/cmdline.py", line 97, in _run_print_help
    func(*a, **kw)
  File "/usr/lib/python2.7/dist-packages/scrapy/cmdline.py", line 139, in _run_command
    cmd.run(args, opts)
  File "/usr/lib/python2.7/dist-packages/scrapy/commands/runspider.py", line 64, in run
    self.crawler.crawl(spider)
  File "/usr/lib/python2.7/dist-packages/scrapy/crawler.py", line 42, in crawl
    requests = spider.start_requests()
  File "/usr/lib/python2.7/dist-packages/scrapy/spider.py", line 55, in start_requests
    reqs.extend(arg_to_iter(self.make_requests_from_url(url)))
  File "/usr/lib/python2.7/dist-packages/scrapy/spider.py", line 59, in make_requests_from_url
    return Request(url, dont_filter=True)
  File "/usr/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 26, in __init__
    self._set_url(url)
  File "/usr/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 61, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: Missing%20value

代替定义__init__() override start_requests()方法:

这是Scrapy在打开蜘蛛时调用的方法在没有指定特定url时进行抓取。如果特定的url指定后,将使用make_requests_from_url()来创建请求。这个方法也只从Scrapy调用一次,所以它是作为生成器来实现它是安全的。

class SiteFeedConstructor(CrawlSpider, FeedConstructor):
    name = "Data_Feed"
    def start_requests(self):
        self.name = str(self.config_json.get('name', 'Missing value'))
        for url in str(self.config_json.get('start_urls', 'Missing value')).split(","):
            yield self.make_requests_from_url(url)

最新更新