我的代码应该从一个目录中获取每个pdf,OCR它并为每个OCR的pdf返回一个.txt文件。pdf 和.txt文件的名称应相同,但.pdf更改为 .txt。我被困在拆分输入 pdf 名称以生成具有 OCR 文件的.txt扩展名的相同名称的部分。目录中的示例文件如下所示:"000dbf9d-d53f-465f-a7ce-722722136fb7465.pdf"。我需要输出为"000dbf9d-d53f-465f-a7ce-722722136fb7465.txt"。此外,我的代码不会创建新的.txt文件,而是每次迭代覆盖一个文件。我需要为每个 OCR .pdf文件创建一个新的.txt文件。代码到现在:
import io
import glob
from PIL import Image
import pytesseract
from wand.image import Image as wi
files = glob.glob(r"D:files**")
for file in files:
#print(file)
pdf = wi(filename = file, resolution = 300)
pdfImg = pdf.convert('jpeg')
imgBlobs = []
for img in pdfImg.sequence:
page = wi(image = img)
imgBlobs.append(page.make_blob('jpeg'))
extracted_texts = []
for imgBlob in imgBlobs:
im = Image.open(io.BytesIO(imgBlob))
text = pytesseract.image_to_string(im, lang = 'eng')
extracted_texts.append(text)
with open("D:\extracted_text\"+ "\file1.txt", 'w') as f:
f.write(str(extracted_texts))
您只需要跟踪文件名并在最后两行中重复使用它:
# ...
import os
files = glob.glob(r"D:files**")
for file in files:
#print(file)
# Get the name of the file less any suffixes
name = os.path.basename(file).split('.')[0]
# ...
# Use `name` from above to name your text file
with open("D:\extracted_text\" + name + ".txt", 'w') as f:
f.write(str(extracted_texts))