我正在使用celery
(和django-celery
(允许用户通过django管理员启动定期抓取。这是一个更大项目的一部分,但我已将问题归结为一个最小示例。
首先,芹菜/芹菜节拍正在运行守护进程。相反,如果我使用我的 django 项目目录中的celery -A evofrontend worker -B -l info
运行它们,那么我不会奇怪地遇到任何问题。
但是,当我将芹菜/芹菜节拍作为守护程序运行时,我会收到一个奇怪的导入错误:
[2016-01-06 03:05:12,292: ERROR/MainProcess] Task evosched.tasks.scrapingTask[e18450ad-4dc3-47a0-b03d-4381a0e65c31] raised unexpected: ImportError('No module named myutils',)
Traceback (most recent call last):
File "/home/lee/Desktop/pyco/evo-scraping-min/venv/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/lee/Desktop/pyco/evo-scraping-min/venv/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
return self.run(*args, **kwargs)
File "evosched/tasks.py", line 35, in scrapingTask
cs = CrawlerScript('TestSpider', scrapy_settings)
File "evosched/tasks.py", line 13, in __init__
self.crawler = CrawlerProcess(scrapy_settings)
File "/home/lee/Desktop/pyco/evo-scraping-min/venv/local/lib/python2.7/site-packages/scrapy/crawler.py", line 209, in __init__
super(CrawlerProcess, self).__init__(settings)
File "/home/lee/Desktop/pyco/evo-scraping-min/venv/local/lib/python2.7/site-packages/scrapy/crawler.py", line 115, in __init__
self.spider_loader = _get_spider_loader(settings)
File "/home/lee/Desktop/pyco/evo-scraping-min/venv/local/lib/python2.7/site-packages/scrapy/crawler.py", line 296, in _get_spider_loader
return loader_cls.from_settings(settings.frozencopy())
File "/home/lee/Desktop/pyco/evo-scraping-min/venv/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 30, in from_settings
return cls(settings)
File "/home/lee/Desktop/pyco/evo-scraping-min/venv/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 21, in __init__
for module in walk_modules(name):
File "/home/lee/Desktop/pyco/evo-scraping-min/venv/local/lib/python2.7/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
submod = import_module(fullpath)
File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
File "retail/spiders/Retail_spider.py", line 16, in <module>
ImportError: No module named myutils
即蜘蛛在从 Django 项目应用程序导入时遇到问题,尽管将相关内容添加到系统日志中,并且执行django.setup()
.
我的预感是这可能是由初始化期间的"循环导入"引起的,但我不确定(有关同一错误的注释,请参阅此处(
芹菜守护程序配置
为了完整起见,芹菜和芹菜节拍配置脚本是:
# /etc/default/celeryd
CELERYD_NODES="worker1"
CELERY_BIN="/home/lee/Desktop/pyco/evo-scraping-min/venv/bin/celery"
CELERY_APP="evofrontend"
DJANGO_SETTINGS_MODULE="evofrontend.settings"
CELERYD_CHDIR="/home/lee/Desktop/pyco/evo-scraping-min/evofrontend"
CELERYD_OPTS="--concurrency=1"
# Workers should run as an unprivileged user.
CELERYD_USER="lee"
CELERYD_GROUP="lee"
CELERY_CREATE_DIRS=1
和
# /etc/default/celerybeat
CELERY_BIN="/home/lee/Desktop/pyco/evo-scraping-min/venv/bin/celery"
CELERY_APP="evofrontend"
CELERYBEAT_CHDIR="/home/lee/Desktop/pyco/evo-scraping-min/evofrontend/"
# Django settings module
export DJANGO_SETTINGS_MODULE="evofrontend.settings"
它们主要基于通用的,在我的虚拟环境中使用芹菜箱而不是系统中的 Django 设置和使用芹菜箱。
我还使用通用脚本的init.d
脚本。
项目结构
至于项目:它生活在/home/lee/Desktop/pyco/evo-scraping-min
.它下的所有文件都有所有权lee:lee
.目录包含一个Scrapy(evo-retail(和Django(evofrontend(项目,它们存在于它下面,完整的树结构看起来像
├── evofrontend
│ ├── db.sqlite3
│ ├── evofrontend
│ │ ├── celery.py
│ │ ├── __init__.py
│ │ ├── settings.py
│ │ ├── urls.py
│ │ └── wsgi.py
│ ├── evosched
│ │ ├── __init__.py
│ │ ├── myutils.py
│ │ └── tasks.py
│ └── manage.py
└── evo-retail
└── retail
├── logs
├── retail
│ ├── __init__.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ └── Retail_spider.py
└── scrapy.cfg
Django 项目相关文件
现在相关文件:evofrontend/evofrontend/celery.py
看起来像
# evofrontend/evofrontend/celery.py
from __future__ import absolute_import
import os
from celery import Celery
# set the default Django settings module for the 'celery' program.
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'evofrontend.settings')
from django.conf import settings
app = Celery('evofrontend')
# Using a string here means the worker will not have to
# pickle the object when using Windows.
app.config_from_object('django.conf:settings')
app.autodiscover_tasks(lambda: settings.INSTALLED_APPS)
Django 设置文件中可能相关的设置evofrontend/evofrontend/settings.py
是
import os
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
PROJECT_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir))
INSTALLED_APPS = (
...
'djcelery',
'evosched',
)
# Celery settings
BROKER_URL = 'amqp://guest:guest@localhost//'
CELERY_ACCEPT_CONTENT = ['json']
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
CELERY_TIMEZONE = 'Europe/London'
CELERYD_MAX_TASKS_PER_CHILD = 1 # Each worker is killed after one task, this prevents issues with reactor not being restartable
# Use django-celery backend database
CELERY_RESULT_BACKEND = 'djcelery.backends.database:DatabaseBackend'
# Set periodic task
CELERYBEAT_SCHEDULER = "djcelery.schedulers.DatabaseScheduler"
调度应用程序中的tasks.py
,evosched
,看起来像(它只是在更改目录后使用相关设置启动 Scrapy 蜘蛛(
# evofrontend/evosched/tasks.py
from __future__ import absolute_import
from celery import shared_task
from celery.utils.log import get_task_logger
logger = get_task_logger(__name__)
import os
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from django.conf import settings as django_settings
class CrawlerScript(object):
def __init__(self, spider, scrapy_settings):
self.crawler = CrawlerProcess(scrapy_settings)
self.spider = spider # just a string
def run(self, **kwargs):
# Pass the kwargs (usually command line args) to the crawler
self.crawler.crawl(self.spider, **kwargs)
self.crawler.start()
@shared_task
def scrapingTask(**kwargs):
logger.info("Start scrape...")
# scrapy.cfg file here pointing to settings...
base_dir = django_settings.BASE_DIR
os.chdir(os.path.join(base_dir, '..', 'evo-retail/retail'))
scrapy_settings = get_project_settings()
# Run crawler
cs = CrawlerScript('TestSpider', scrapy_settings)
cs.run(**kwargs)
evofrontend/evosched/myutils.py
仅包含(在此最小示例中(:
# evofrontend/evosched/myutils.py
SCRAPY_XHR_HEADERS = 'SOMETHING'
刮擦项目相关文件
在完整的 Scrapy 项目中,设置文件如下所示
# evo-retail/retail/retail/settings.py
BOT_NAME = 'retail'
import os
PROJECT_ROOT = os.path.dirname(os.path.abspath(__file__))
SPIDER_MODULES = ['retail.spiders']
NEWSPIDER_MODULE = 'retail.spiders'
并且(在这个最小示例中(蜘蛛只是
# evo-retail/retail/retail/spiders/Retail_spider.py
from scrapy.conf import settings as scrapy_settings
from scrapy.spiders import Spider
from scrapy.http import Request
import sys
import django
import os
import posixpath
SCRAPY_BASE_DIR = scrapy_settings['PROJECT_ROOT']
DJANGO_DIR = posixpath.normpath(os.path.join(SCRAPY_BASE_DIR, '../../../', 'evofrontend'))
sys.path.insert(0, DJANGO_DIR)
os.environ.setdefault("DJANGO_SETTINGS_MODULE", 'evofrontend.settings')
django.setup()
from evosched.myutils import SCRAPY_XHR_HEADERS
class RetailSpider(Spider):
name = "TestSpider"
def start_requests(self):
print SCRAPY_XHR_HEADERS
yield Request(url='http://www.google.com', callback=self.parse)
def parse(self, response):
print response.url
return []
编辑:
我通过大量的试验和错误发现,如果我尝试导入的应用程序在我的 INSTALLED_APPS
django 设置中,那么它会失败并出现导入错误,但如果我从那里删除应用程序,那么我不再收到导入错误(例如,从INSTALLED_APPS
中删除evosched
然后蜘蛛中的导入会很好......显然不是解决方案,但可能是一个线索。
编辑 2
我在导入失败之前立即在蜘蛛中打印了sys.path
,结果是
/home/lee/Desktop/pyco/evo-scraping-min/evofrontend/../evo-retail/retail
/home/lee/Desktop/pyco/evo-scraping-min/venv/lib/python2.7
/home/lee/Desktop/pyco/evo-scraping-min/venv/lib/python2.7/plat-x86_64-linux-gnu
/home/lee/Desktop/pyco/evo-scraping-min/venv/lib/python2.7/lib-tk
/home/lee/Desktop/pyco/evo-scraping-min/venv/lib/python2.7/lib-old
/home/lee/Desktop/pyco/evo-scraping-min/venv/lib/python2.7/lib-dynload
/usr/lib/python2.7
/usr/lib/python2.7/plat-x86_64-linux-gnu
/usr/lib/python2.7/lib-tk
/home/lee/Desktop/pyco/evo-scraping-min/venv/local/lib/python2.7/site-packages
/home/lee/Desktop/pyco/evo-scraping-min/evofrontend
/home/lee/Desktop/pyco/evo-scraping-min/evo-retail/retail`
编辑 3
如果我做import evosched
然后print dir(evosched)
,我会看到"任务",如果我选择包含这样的文件,我也可以看到"模型",因此实际上可以从模型导入。然而,我没有看到" myutils"。即使from evosched import myutils
也会失败,如果将语句放在下面的函数中而不是作为全局函数中,也会失败(我认为这可能会路由循环导入问题......直接import evosched
有效...可能import evosched.utils
会起作用。尚未尝试...
似乎芹菜守护进程正在使用系统的 python 而不是 virtualenv 中的python
二进制文件运行。您需要使用
# Python interpreter from environment.
ENV_PYTHON="$CELERYD_CHDIR/env/bin/python"
正如这里提到的,告诉芹菜在虚拟环境中使用python运行。