所以基本上我有一个相当复杂的工作流程,看起来类似于:
>>> res = (add.si(2, 2) | add.s(4) | add.s(8))()
>>> res.get()
16
之后,对我来说,沿着结果链走并收集所有单独的结果是相当琐碎的:
>>> res.parent.get()
8
>>> res.parent.parent.get()
4
我的问题是,如果我的第三个任务依赖于知道第一个任务的结果,但在本例中只收到第二个任务的效果,该怎么办?
此外,链很长,结果也不那么小,因此仅仅通过输入会不必要地污染结果存储。这是Redis,所以在使用RabbitMQ、ZeroMQ时存在限制,。。。不适用。
也许您的设置太复杂了,但我喜欢将group
与noop
任务结合使用来完成类似的任务。我这样做是因为我想突出显示在我的管道中仍然同步的区域(通常这样可以删除它们)。
使用与您的示例类似的东西,我从一组任务开始,看起来如下:
tasks.py
:
from celery import Celery
app = Celery('tasks', backend="redis", broker='redis://localhost')
@app.task
def add(x, y):
return x + y
@app.task
def xsum(elements):
return sum(elements)
@app.task
def noop(ignored):
return ignored
通过这些任务,我使用一个组创建了一个链来控制依赖于同步结果的结果:
In [1]: from tasks import add,xsum,noop
In [2]: from celery import group
# First I run the task which I need the value of later, then I send that result to a group where the first task does nothing and the other tasks are my pipeline.
In [3]: ~(add.si(2, 2) | group(noop.s(), add.s(4) | add.s(8)))
Out[3]: [4, 16]
# At this point I have a list where the first element is the result of my original task and the second element has the result of my workflow.
In [4]: ~(add.si(2, 2) | group(noop.s(), add.s(4) | add.s(8)) | xsum.s())
Out[4]: 20
# From here, things can go back to a normal chain
In [5]: ~(add.si(2, 2) | group(noop.s(), add.s(4) | add.s(8)) | xsum.s() | add.s(1) | add.s(1))
Out[5]: 22
我希望这是有用的!
我为每个链分配一个作业id,并通过将数据保存在数据库中来跟踪此作业。
启动队列
if __name__ == "__main__":
# Generate unique id for the job
job_id = uuid.uuid4().hex
# This is the root parent
parent_level = 1
# Pack the data. The last value is your value to add
parameters = job_id, parent_level, 2
# Build the chain. I added an clean task that removes the data
# created during the process (if you want it)
add_chain = add.s(parameters, 2) | add.s(4) | add.s(8)| clean.s()
add_chain.apply_async()
现在任务
#Function for store the result. I used sqlalchemy (mysql) but you can
# change it for whatever you want (distributed file system for example)
@inject.params(entity_manager=EntityManager)
def save_result(job_id, level, result, entity_manager):
r = Result()
r.job_id = job_id
r.level = level
r.result = result
entity_manager.add(r)
entity_manager.commit()
#Restore a result from one parent
@inject.params(entity_manager=EntityManager)
def get_result(job_id, level, entity_manager):
result = entity_manager.query(Result).filter_by(job_id=job_id, level=level).one()
return result.result
#Clear the data or do something with the final result
@inject.params(entity_manager=EntityManager)
def clear(job_id, entity_manager):
entity_manager.query(Result).filter_by(job_id=job_id).delete()
@app.task()
def add(parameters, number):
# Extract data from parameters list
job_id, level, other_number = parameters
#Load result from your second parent (level - 2)
#For level 3 parent level - 3 and so on
#second_parent_result = get_result(job_id, level - 2)
# do your stuff, I guess you want to add numbers
result = number + other_number
save_result(job_id, level, result)
#Return the result of the sum or anything you want, but you have to send something because the "add" function expects 3 values
#Of course your should return the actual job and increment the parent level
return job_id, level + 1, result
@app.task()
def clean(parameters):
job_id, level, result = parameters
#Do something with final result or not
#Clear the data
clear(job_id)
我使用entity_manager来管理数据库操作。我的实体管理器使用sql alchemy和mysql。我还使用了一个表"result"来存储部分结果。这一部分应该为您最好的存储系统进行更改(或者如果mysql适合您,则使用此部分)
from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine
import inject
class EntityManager():
session = None
@inject.params(config=Configuration)
def __init__(self, config):
conf = config['persistence']
uri = conf['driver'] + "://" + conf['username'] + ":@" + conf['host'] + "/" + conf['database']
engine = create_engine(uri, echo=conf['debug'])
Session = sessionmaker(bind=engine)
self.session = Session()
def query(self, entity_type):
return self.session.query(entity_type)
def add(self, entity):
return self.session.add(entity)
def flush(self):
return self.session.flush()
def commit(self):
return self.session.commit()
class Configuration:
def __init__(self, params):
f = open(os.environ.get('PYTHONPATH') + '/conf/config.yml')
self.configMap = yaml.safe_load(f)
f.close()
def __getitem__(self, key: str):
return self.configMap[key]
class Result(Base):
__tablename__ = 'result'
id = Column(Integer, primary_key=True)
job_id = Column(String(255))
level = Column(Integer)
result = Column(Integer)
def __repr__(self):
return "<Result (job='%s', level='%s', result='%s')>" % (self.job_id, str(self.level), str(self.result))
我使用包注入来获得依赖项注入器。inject包将重用对象,这样您就可以随时注入对数据库的访问,而无需担心连接问题。
类配置是在配置文件中加载数据库访问数据。您可以替换它并使用静态数据(硬编码的映射)进行测试。
更改任何其他适合您的依赖项注入。这只是我的解决方案。我添加它只是为了快速测试。
这里的关键是将部分结果保存在队列系统中的某个位置,并在任务中返回用于访问这些结果的数据(job_id和父级)。您将发送这个额外的(但很小的)数据,它是一个指向真实数据(一些大数据)的地址(job_id+父级)。
这个解决方案是我在软件中使用的东西
一个简单的解决方法是将任务的结果存储在列表中,并在任务中使用它们。
from celery import Celery, chain
from celery.signals import task_success
results = []
app = Celery('tasks', backend='amqp', broker='amqp://')
@task_success.connect()
def store_result(**kwargs):
sender = kwargs.pop('sender')
result = kwargs.pop('result')
results.append((sender.name, result))
@app.task
def add(x, y):
print("previous results", results)
return x + y
现在,在您的链中,可以从任何任务以任何顺序访问以前的所有结果。