文件打开过多错误多重处理

我正在解析 100K 文件中的数据并将这些数据保存到另一个文件中以供进一步处理。我在python中实现了多处理模块以加速该过程。

processes = []
for num in range(1, 5000):
string = "{0:06}".format(num)
path = "filename"+num+".npy"
check_file_exist = Path(path)
if check_file_exist.is_file():
## Multiprocessing for generating file using multiple cpus
p = multiprocessing.Process(target=Get_feature_vector, args=(path,))
processes.append(p)
p.start()
else:
print("file not found", string)
for process in processes:
process.join()

上面的代码创建错误[Errno 24] Too many open files。为了解决此错误，我如何进行多处理以一次仅打开 20-30 个文件进行处理？

我曾经阅读过pool.map()的文档，但创建 100K 文件名列表超出了我的预期。我们是否有任何有效的加速方法而无需打开大量文件？我有一台有 40 个处理器的计算机。

如果您不想生成整个列表，请使用方便的生成器，这将产生指定数量的文件，并将其馈送到pool.map()：

def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in range(0, len(l), n):
yield l[i:i + n]
import pprint
pprint.pprint(list(chunks(range(10, 75), 10)))
[[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
[50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
[60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
[70, 71, 72, 73, 74]]

或者，在您的情况下，您可以使用：

for chunk in chunks(range(1, 5000), 10):  # chunk size is the same as pool size = 10
file_names = []
for num in chunk :
string = "{0:06}".format(num)
path = "filename"+num+".npy"
check_file_exist = Path(path)
file_names.append( path )
Pool(10).map( Get_feature_vector, file_names ) # etc.

相关内容

最新更新

热门标签：