在脚本文件函数中获取Scrapy爬虫输出/结果



我使用脚本文件在scrapy项目中运行蜘蛛,蜘蛛正在记录爬虫输出/结果。但我想使用蜘蛛输出/结果在该脚本文件中的一些功能。我不想保存输出/结果在任何文件或DB。以下是从https://doc.scrapy.org/en/latest/topics/practices.html#run-from-script

获取的脚本代码
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner(get_project_settings())

d = runner.crawl('my_spider')
d.addBoth(lambda _: reactor.stop())
reactor.run()
def spider_output(output):
#     do something to that output

如何在'spider_output'方法中获得蜘蛛输出。可以得到输出/结果

这是在一个列表中获得所有输出/结果的解决方案

from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.signalmanager import dispatcher

def spider_results():
    results = []
    def crawler_results(signal, sender, item, response, spider):
        results.append(item)
    dispatcher.connect(crawler_results, signal=signals.item_scraped)
    process = CrawlerProcess(get_project_settings())
    process.crawl(MySpider)
    process.start()  # the script will block here until the crawling is finished
    return results

if __name__ == '__main__':
    print(spider_results())

这是一个老问题,但供将来参考。如果你正在使用python 3.6+,我建议使用scrapyscript,它允许你运行你的蜘蛛,并以一种超级简单的方式获得结果:

from scrapyscript import Job, Processor
from scrapy.spiders import Spider
from scrapy import Request
import json
# Define a Scrapy Spider, which can accept *args or **kwargs
# https://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments
class PythonSpider(Spider):
    name = 'myspider'
    def start_requests(self):
        yield Request(self.url)
    def parse(self, response):
        title = response.xpath('//title/text()').extract()
        return {'url': response.request.url, 'title': title}
# Create jobs for each instance. *args and **kwargs supplied here will
# be passed to the spider constructor at runtime
githubJob = Job(PythonSpider, url='http://www.github.com')
pythonJob = Job(PythonSpider, url='http://www.python.org')
# Create a Processor, optionally passing in a Scrapy Settings object.
processor = Processor(settings=None)
# Start the reactor, and block until all spiders complete.
data = processor.run([githubJob, pythonJob])
# Print the consolidated results
print(json.dumps(data, indent=4))
[
    {
        "title": [
            "Welcome to Python.org"
        ],
        "url": "https://www.python.org/"
    },
    {
        "title": [
            "The world's leading software development platform u00b7 GitHub",
            "1clr-code-hosting"
        ],
        "url": "https://github.com/"
    }
]

恐怕没有办法做到这一点,因为crawl():

返回一个在爬行完成时触发的延迟。

爬虫不会将结果存储在任何地方,而是输出到记录器。

然而,返回输出将与scrapy的整个异步性质和结构相冲突,因此这里首选的方法是先保存到文件,然后再读取。
您可以简单地设计将项目保存到文件的管道,并简单地读取spider_output中的文件。您将收到您的结果,因为reactor.run()阻塞了您的脚本,直到输出文件完成。

我的建议是使用Python subprocess模块从脚本中运行spider,而不是使用scrapy文档中提供的方法从Python脚本中运行spider。这样做的原因是,使用subprocess模块,您可以捕获输出/日志,甚至从蜘蛛内部捕获print的语句。

在Python 3中,使用run方法执行爬行器。交货。

import subprocess
process = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
if process.returncode == 0:
    result = process.stdout.decode('utf-8')
else:
    # code to check error using 'process.stderr'

将stdout/stderr设置为subprocess.PIPE将允许捕获输出,因此设置此标志非常重要。这里command应该是一个序列或字符串(如果它是一个字符串,那么调用run方法,再加一个参数:shell=True)。例如:

command = ['scrapy', 'crawl', 'website', '-a', 'customArg=blahblah']
# or
command = 'scrapy crawl website -a customArg=blahblah' # with shell=True
#or
import shlex
command = shlex.split('scrapy crawl website -a customArg=blahblah') # without shell=True

同样,process.stdout将包含脚本的输出,但它的类型将是bytes。您需要使用decode('utf-8')

将其转换为str

它将在一个列表中返回一个Spider的所有结果。

from scrapyscript import Job, Processor
from scrapy.utils.project import get_project_settings

def get_spider_output(spider, **kwargs):
    job = Job(spider, **kwargs)
    processor = Processor(settings=get_project_settings())
    return processor.run([job])

最新更新