使用Python加速数万个文档到docx的转换



我有超过44K个文档文件等待转换为docx。我用来转换单个文档文件的代码如下:

from win32com import client
def doc2docx(doc_name):
word = client.Dispatch("Word.Application")
doc = word.Documents.Open(doc_name)
docx_name = doc_name.replace(".doc", ".docx")
doc.SaveAs(docx_name, 16)
doc.Close()
word.Quit()

我尝试了以下代码来转换10个文档的子集:

from glob import glob
from time import time
paths = glob("U:\WordDocuments*.doc")
start = time()
counter = 0
for i in paths:
doc2docx(i)
counter += 1
print(counter)
end = time()
duration = end -start
print("It took", duration, "seconds to process 10 doc files.")

上面的代码运行时没有出现错误。然而,隐藏10份文档花了3分钟多的时间。如何加快此过程?我可以想到多线程或多处理,但我不知道如何实现它们。非常感谢。

from win32com import client
from glob import glob
from time import time
from multiprocessing import Pool

def doc2docx(doc_name):
word = client.Dispatch("Word.Application")
doc = word.Documents.Open(doc_name)
docx_name = doc_name.replace(".doc", ".docx")
doc.SaveAs(docx_name, 16)
doc.Close()
word.Quit()
paths = glob("U:\WordDocuments*.doc")
global start
start = time()
A = []
pool = Pool()
r=pool.map_async(doc2docx,paths,callback=pool_processing_complete)
r.wait()
pool.close()
pool.join()
def pool_processing_complete(x):
A.extend(x)
global start
end = time()
duration = end -start
print("It took", duration, "seconds to process 10 doc files.")

使用多处理池这是示例。

最新更新