我正在尝试按品牌标记多个图像->产品->每个产品图像。由于每次给每个图像加一个标签需要一点时间,所以我决定使用多处理来加快这项工作。我尝试过使用多处理,它确实加快了标记图像的速度,但代码并没有按照我的预期工作
代码:
def multiprocessing_func(line):
json_line = json.loads(line)
product = json_line['groupid']
active_urls = set(json_line['urls'])
try:
active_urls.remove(brand_dic[brand])
except:
pass
if product in saved_product_dict and active_urls == saved_product_dict[product]:
keep_products.append(product)
print('True')
else:
with open(new_images_filename, 'a') as save_file:
labels = label_product_images(line)
save_file.write('{}n'.format(json.dumps(labels)))
print('False')
active_images_filename = 'data/input/image_urls.json'
new_images_filename = 'data/output/new_labeled_images.json'
saved_images_filename = 'data/output/saved_labeled_images.json'
brand_dic = {'a': 'https://www.a.com/imgs/ab/images/dp/m.jpg',
'b': 'https://www.b.com/imgs/ab/images/wcm/m.jpg',
'c': 'https://www.c.com/imgs/ab/images/dp/m.jpg',}
if __name__ == '__main__':
brands = ['a', 'b', 'c']
for brand in brands:
active_images_filename = 'data/input/brands/' + brand + '/image_urls.json'
new_images_filename = 'data/output/brands/' + brand + '/new_labeled_images.json'
saved_images_filename = 'data/output/brands/' + brand + '/saved_labeled_images.json'
print(new_images_filename)
with open(new_images_filename, 'w'): pass
saved_product_dict = {}
with open(saved_images_filename) as in_file:
for line in in_file:
json_line = json.loads(line)
saved_urls = [url for urls_list in json_line['urls'] for url in urls_list]
saved_product_dict[json_line['groupid']] = set(saved_urls)
print(saved_product_dict)
keep_products = []
labels_list = []
with open(active_images_filename, 'r') as in_file:
processes = []
for line in in_file:
p = multiprocessing.Process(target=multiprocessing_func, args=(line,))
processes.append(p)
p.start()
print('complete stage 1')
for i in range(0,2):
print('running stage 2')
输出:
data/output/brands/mg/new_labeled_images.json
{}
complete stage 1
running stage 2
running stage 2
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202025/0011/terminal-1-soft-sided-carry-on-m.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202025/0011/terminal-1-soft-sided-carry-on-m.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202010/0027/anchor-hope-and-protect-necklace-m.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202007/0003/patterned-folded-notecards-set-of-25-m.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202005/0003/patterned-folded-notecards-set-of-25-t.jpg
silo : https://a/mgimgs/rk/images/dp/wcm/202007/0002/patterned-folded-notecards-set-of-25-1-m.jpg
unmatched : https://www.a.com/mgimgs/rk/images/dp/a/202010/0013.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/a/202007/0002.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/a/202007/0003.jpg
False
unmatched : https://www.a.com/mgimgs/rk/images/dp/a/202010/0022.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202019/454.jpg
False
lifestyle - Lif1 : https://a.com/mgimgs/rk/images/dp/wcm/202025/0011.jpg
False
False
我注意到多处理步骤运行在最后并跳过代码,我不确定它为什么会这样做。我也不确定为什么它没有运行第一部分,当我尝试打印";saved_product_dict";,字典空了。
我有在多处理步骤之前和之后运行的代码。我的问题是如何强制多处理步骤按照我编写代码的顺序运行。如能对所发生的一切作出解释,我们将不胜感激。我是使用多处理的新手,我仍在学习它的工作原理。
这一行似乎错了。尝试更改
saved_urls = [url for urls_list in json_line['urls'] for url in urls_list]
带有:
saved_urls = [url for urls_list in json_line['urls]]
这可能是您问题第一部分的解决方案。
关于程序的多处理部分和主线程的打印。打印顺序并不总是异步环境中函数/脚本运行时间的正确指示器(这里存在不同的进程(。如果您想按定义的顺序运行脚本,您需要使用信号量和互斥实现同步机制,或者等待所有进程退出后再进入第2阶段,我认为这是您主要关心的问题。