多处理批处理在 Python 中突然停止



我正在使用gensim word2vec从与查询文本匹配的语料库中返回最相似的文本。例如,这里有一些相关的代码行可以开始工作:

model = gensim.models.KeyedVectors.load_word2vec_format('/users/myuser/method_approaches/google_news_requirements/GoogleNews-vectors-negative300.bin.gz', binary=True)
instance = WmdSimilarity(processed_set, model, num_best=10)

然后我有一个非常简单的函数,它在传递给多处理器时运行实例:

def get_most_similar_for_a_given_text(instance,text,output):
i=instance[text]
output.put(i)

然后我有一个批处理脚本

def get_most_similar_for_all_texts_in_set(processed_set, instance):
output = mp.Queue()
# Setup a list of processes that we want to run
processes = [mp.Process(target=get_most_similar_for_a_given_text, args=(instance, text, output)) for text in processed_set]
num_cores = mp.cpu_count()
Scaling_factor_batch_jobs = 3
number_of_jobs = len(processes)
num_jobs_per_batch = num_cores * Scaling_factor_batch_jobs
num_of_batches = int(number_of_jobs // num_jobs_per_batch)+1 
print('n'+'Running batches now...')
for i in tqdm.tqdm(range(num_of_batches)):
# although the floor/ceilings look like things are getting double counted, for instance with ranges being 0:24,24:48,48.. etc.. this is not the case, for whatever reason it doesn't work like that
if i<num_of_batches-1: # true for all but last one
floor_job = int(i * num_jobs_per_batch) # int because otherwise it's a float and mp doesn't like that
ceil_job  = int(floor_job + num_jobs_per_batch)
# Run processes
for p in processes[floor_job : ceil_job]:
p.start()
for p in processes[floor_job : ceil_job]:
p.join()
for p in mp.active_children():
p.terminate()
print(floor_job,ceil_job)
else: # true on last job, which picks up the missing batches that were lost due to rounding in the num_of_batches/num_jobs_per_batch formulas
floor_job = int(i * num_jobs_per_batch)
# Run processes
for p in processes[floor_job:]:
p.start()
for p in processes[floor_job:]:
p.join()
for p in mp.active_children():
p.terminate()
print(floor_job,len(processes))
# Get process results from the output queue
results = [output.get() for p in tqdm.tqdm(processes)]
np.save('/users/josh.flori/method_approaches/numpy_files/wmd_results_list.npy', results)
return results

当我运行它时实际发生的事情是它以 1:4 的比分数运行批次。这些批次占processed_set文本 0:96,这是我正在循环播放的文本。但随后它到达第 5 批,文本 96:120,它似乎只是停止处理,但没有失败、退出、崩溃或做任何事情。从视觉上看,它似乎仍在运行,但并不是因为我的 cpu 活动下降并且进度条停止移动。

我目视检查了processed_set的96:120文本,看起来没有什么奇怪的。然后,我在多处理函数之外对这些文本隔离运行get_most_similar_for_a_given_text函数,它们完成得很好。

无论如何,重申一下,它总是发生在第 5 批。有人在这里有任何见解吗?我不太熟悉多处理的工作原理。

再次感谢

这可能是因为您正在使用队列。如果队列已满,则进程将卡住尝试放入队列。尝试使用非常小的processed_set进行测试,看看是否所有作业都已完成。如果是这样,您可能希望使用管道来获取大量结果。

相关内容

  • 没有找到相关文章

最新更新