下面的代码是否是跨多个处理器并行执行func的正确方法(dfA_Train
和dfB_Train
是Pandas DataFrames的列表(?
from multiprocessing import Pool
import itertools
def func(dfA, dfB, param):
L = []
...
return L #L is a list of dictionaries
if __name__ == "__main__":
dictList = []
for param in paramGrid:
with Pool(processes=8) as pool:
rslts = pool.starmap(func, zip(dfA_Train, dfB_Train, itertools.repeat(param)))
rslts = [item for subList in rslts for item in subList if item != []]
_ = [dictList.append(item) for item in rslts]
results = pd.DataFrame.from_dict(dictList) if dictList else None
它似乎不会在超过1个进程中运行——单个进程和多个进程之间没有加速。
我相信multiprocessing.dummy
只是根据下面的文档调用了hoods下的多个线程。当我们寻找线程时,显然会调用一个进程,其中每个线程都由全局解释器锁控制。multiprocessing.Pool()
将是实现并行性的合适候选者。与线程不同,Pool
对象为每个任务分配单独的进程。例如,考虑以下示例。
https://docs.python.org/3/library/multiprocessing.html#module-多重访问。摘要
from multiprocessing import Pool
def worker_function(arg_1, arg_2):
print('do some work with arg_1 and arg_2')
work = (["A", 5], ["B", 2], ["C", 1], ["D", 3])
proc_pool = Pool(<desired no. of processes>) # create a multi processing Pool
proc_pool.map(worker_function, work) # start parallel work