多处理器panda数据帧块



我正在处理一个巨大的csv文件(15+GB(。我使用模糊匹配来提取行,但当我检查资源监视器时,脚本似乎只使用了1个核心,并且需要很长时间来处理。这是当前脚本的一个示例

with open('output.txt', 'w', newline='', encoding = 'utf8') as fw:
writer = csv.writer(fw, delimiter = ',',lineterminator = 'n')
for chunk in pd.read_csv("inputfile.csv", chunksize =10000, sep = ','):
for index,row in chunk.iterrows():
if (fuzz.token_set_ratio("search_terms","{0}".format(row['item_name'])) >90 and row['brand'] != 'example_brand'):
print(row['item_name'],row['brand']) # this is just for visual confirmation since the script runs for hours and hours.
line = (row['id'],row['brand'],row['item_name'])
writer.writerow(line)

我想设置它,以便使用multiprocessing.pool将块分发到多个进程,但我对python还很陌生,在下面的例子中没有任何运气,也没有让它发挥作用。下面的脚本固定了所有4个cpu核心,并且似乎正在生成一堆进程,然后这些进程立即终止,而据我所知,它们什么都不做。有人知道它为什么会这样,以及如何让它正常工作吗?

def fuzzcheck(chunk):
for index,row in chunk.iterrows():
if (fuzz.token_set_ratio("search_terms","{0}".format(row['item_name'])) >90 and row['brand'] != "example_brand"):
print(row['item_name'],row['brand'])
line = (row['ID'],row['brand'],row['item_name'])
writer.writerow(line)
with mp.Pool(4) as pool, open('output.txt', 'w', newline='', encoding = 'utf8') as fw:
writer = csv.writer(fw, delimiter = ',',lineterminator = 'n')
for chunk in pd.read_csv("inputfile.csv", chunksize =10000, sep = ','):
pool.apply(fuzzcheck, chunk)

此处包含答案:无多处理打印输出(Spyder(

事实证明,除非在新窗口中启动,否则Spyder不会运行多处理。

相关内容

  • 没有找到相关文章

最新更新