在python中为多个参数并行运行单个函数的最快方法

假设我有一个单独的函数processing。我想对多个参数并行地多次运行同一个函数，而不是一个接一个地依次运行。

def processing(image_location):

image = rasterio.open(image_location)
...
...
return(result)
#calling function serially one after the other with different parameters and saving the results to a variable.
results1 = processing(r'/home/test/image_1.tif')
results2 = processing(r'/home/test/image_2.tif')
results3 = processing(r'/home/test/image_3.tif')

例如，如果我运行delineation(r'/home/test/image_1.tif')、delineation(r'/home/test/image_2.tif')和delineation(r'/home/test/image_3.tif')，如上面的代码所示，它将依次运行，如果一个函数运行需要5分钟，那么运行这三个函数将需要5x3=15分钟。因此，我想知道我是否可以并行/令人尴尬地并行运行这三个参数，以便只需要5分钟就可以为所有三个不同的参数执行函数。

帮我找出做这项工作的最快方法。脚本应该能够利用默认情况下可用的所有资源/CPU/ram来完成此任务。

您可以使用multiprocessing并行执行函数，并将结果保存到results变量：

from multiprocessing.pool import ThreadPool
pool = ThreadPool()
images = [r'/home/test/image_1.tif', r'/home/test/image_2.tif', r'/home/test/image_3.tif']
results = pool.map(delineation, images)

您可能需要了解IPython Parallel。它允许您在负载平衡(本地(集群上轻松地运行函数。

对于这个小示例，请确保安装了IPython Parallel、NumPy和Pillow。要运行该示例，您需要首先启动集群。要启动具有四个并行引擎的本地集群，请在终端中键入(一个引擎对应一个处理器核心似乎是一个合理的选择(：

ipcluster 4

然后，您可以运行以下脚本，该脚本在给定的目录中搜索jpg图像，并计算每个图像中的像素数：

import ipyparallel as ipp

rc = ipp.Client()
with rc[:].sync_imports():  # import on all engines
import numpy
from pathlib import Path
from PIL import Image

lview = rc.load_balanced_view()  # default load-balanced view
lview.block = True  # block until map() is finished

@lview.parallel()
def count_pixels(fn: Path):
"""Silly function to count the number of pixels in an image file"""
im = Image.open(fn)
xx = numpy.asarray(im)
num_pixels = xx.shape[0] * xx.shape[1]
return fn.stem, num_pixels

pic_dir = Path('Pictures')
fn_lst = pic_dir.glob('*.jpg')  # list all jpg-files in pic_dir
results = count_pixels.map(fn_lst)  # execute in parallel
for n_, cnt in results:
print(f"'{n_}' has {cnt} pixels.")

使用multiprocessing库的另一种编写方法(有关不同的函数，请参阅@Alderven(。

import multiprocessing as mp
def calculate(input_args):
result = input_args * 2
return result
N = mp.cpu_count()
parallel_input = np.arange(0, 100)
print('Amount of CPUs ', N)
print('Amount of iterations ', len(parallel_input))
with mp.Pool(processes=N) as p:
results = p.map(calculate, list(parallel_input))

results变量将包含一个包含已处理数据的列表。然后你就可以写了。

我认为最简单的方法之一是使用joblib:

import joblib
allJobs = []
allJobs.append(joblib.delayed(processing)(r'/home/test/image_1.tif'))
allJobs.append(joblib.delayed(processing)(r'/home/test/image_2.tif'))
allJobs.append(joblib.delayed(processing)(r'/home/test/image_3.tif'))
results = joblib.Parallel(n_jobs=joblib.cpu_count(), verbose=10)(allJobs)

相关内容

最新更新

热门标签：