在这里,我想每 1 分钟用芹菜运行一次爬虫。我按如下方式编写任务,并在视图中调用了delay
的任务,但我没有得到结果。
我在不同的终端运行celery -A mysite worker -l info
芹菜,兔子经纪人,刮擦和django服务器。 爬网程序主页视图通过创建任务对象成功重定向到任务列表。但是芹菜不起作用
它在芹菜控制台中抛出此错误ValueError: not enough values to unpack (expected 3, got 0) [2020-06-08 15:36:06,732: INFO/MainProcess] Received task: crawler.tasks.schedule_task[3b537143-caa8-4445-b3d6-c0bc8d301b89] [2020-06-08 15:36:06,735: ERROR/MainProcess] Task handler raised error: ValueError('not enough values to unpack (expected 3, got 0)') Traceback (most recent call last): File "....venvlibsite-packagesbilliardpool.py", line 362, in workloop result = (True, prepare_result(fun(*args, **kwargs))) File "....venvlibsite-packagesceleryapptrace.py", line 600, in _fast_trace_task tasks, accept, hostname = _loc ValueError: not enough values to unpack (expected 3, got 0)
视图
class CrawlerHomeView(LoginRequiredMixin, View):
login_url = 'users:login'
def get(self, request, *args, **kwargs):
frequency = Task()
categories = Category.objects.all()
targets = TargetSite.objects.all()
keywords = Keyword.objects.all()
form = CreateTaskForm()
context = {
'targets': targets,
'keywords': keywords,
'frequency': frequency,
'form':form,
'categories': categories,
}
return render(request, 'index.html', context)
def post(self, request, *args, **kwargs):
form = CreateTaskForm(request.POST)
if form.is_valid():
unique_id = str(uuid4()) # create a unique ID.
obj = form.save(commit=False)
obj.created_by = request.user
obj.unique_id = unique_id
obj.status = 0
obj.save()
form.save_m2m()
schedule_task.delay(obj.pk)
return render(request, 'index.html', {'form':form, 'errors':form.errors})
tasks.py
scrapyd = ScrapydAPI('http://localhost:6800')
@periodic_task(run_every=crontab(minute=1)) # how to do with task search_frequency value ?
def schedule_task(pk):
task = Task.objects.get(pk=pk)
if task.status == 0 or task.status == 1 and not datetime.date.today() >= task.scraping_end_date:
unique_id = str(uuid4()) # create a unique ID.
keywords = ''
# for keys in ast.literal_eval(obj.keywords.all()): #keywords change to csv
for keys in task.keywords.all():
if keywords:
keywords += ', ' + keys.title
else:
keywords += keys.title
settings = {
'spider_count': len(task.targets.all()),
'keywords': keywords,
'unique_id': unique_id, # unique ID for each record for DB
'USER_AGENT': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
}
# res = ast.literal_eval(ini_list)
for site_url in task.targets.all():
domain = urlparse(site_url.address).netloc # parse the url and extract the domain
spider_name = domain.replace('.com', '')
scrapyd.schedule('default', spider_name, settings=settings, url=site_url.address, domain=domain,
keywords=keywords)
elif task.scraping_end_date == datetime.date.today():
task.status = 2
task.save() # change the task status as completed.
设置
CELERY_BROKER_URL = 'amqp://localhost'
编辑
这个答案帮助我找到了芹菜引发 ValueError 的解决方案:没有足够的值来解压缩。
现在这个错误已经消失了。 现在在芹菜控制台中我看到了这个[2020-06-08 16:33:23,123: INFO/MainProcess] Task crawler.tasks.schedule_task[0578558d-0dc6-4db7-b69f-e912b604ff3d] succeeded in 0.016000000000531145s: None
并且在我的前端没有抓取结果.
现在我的问题是如何检查我的任务是否每 1 分钟定期运行一次?
这是我第一次使用芹菜,所以这里可能会有一些问题。
Celery 不再支持作为平台的 Windows (版本 4 放弃官方支持(
我强烈建议你改用docker化你的应用程序(或使用wsl2(,如果你不想走这条路 您可能需要使用 gevent(请注意,如果您走这条路线,可能会有一些其他问题(
pip install gevent
celery -A <module> worker -l info -P gevent
在这里找到类似的详细答案