如何使用PyPDF2从PDF中以正确的顺序提取文本

我目前正在做一个提取PDF内容的项目。代码运行顺利，我可以提取文本，但提取的文本顺序不正确。代码以一种奇怪的方式提取文本。课文的顺序很乱。它不是从上到下的，真的很令人困惑。

我在网上查了一下，但关于如何订购文本提取，几乎没有什么帮助。大多数教程都得出了相同的结果。作为参考，这是我目前正在测试的PDF(第5页(：https://www.pidm.gov.my/PIDM/files/13/134b5c79-5319-4199-ac68-99f62aca6047.pdf

import PyPDF2
with open('pdftest2.pdf', 'rb') as pdfTest:
reader = PyPDF2.PdfFileReader(pdfTest)
page5 = reader.getPage(4)
text = page5.extractText()
print(text)

提取的文本总是从页面的页脚开始，然后自下而上。我注意到在下一页中，它会从上到下开始，但只针对几个特定的句子。然后它会从页面的不同位置提取文本，而不是从它停止的地方继续

所有的文本都被提取了，但提取的顺序是到处都是。这个问题有什么解决办法吗？

我不得不处理一个类似的问题，结果发现模块pdfplumber比PyPDF工作得更好。我想这取决于文档本身，你应该试试。

否则，问题的另一个答案是使用pdf2image模块将PDF视为图像，并使用pytesseract提取其中的文本。然而，它可能不是完美的方法，因为pdf2image方法convert_from_path可能需要相当长的时间才能运行。

如果你感兴趣的话，我会把一些代码放在这里。

首先，确保您安装了所有必要的depedements以及Tesseract和ImageMagik。您可以在网站上找到任何有关安装的信息。如果你正在使用windows，这里有一篇很好的Medium文章。

使用pdf2image将PDF转换为图像：

如果您正在处理windows，请不要忘记添加poppler路径。它应该看起来像r'C:<your_path>poppler-21.02.0Librarybin'

def pdftoimg(fic,output_folder, poppler_path):
# Store all the pages of the PDF in a variable 
pages = convert_from_path(fic, dpi=500,output_folder=output_folder,thread_count=9, poppler_path=poppler_path) 
image_counter = 0
# Iterate through all the pages stored above 
for page in pages: 
filename = "page_"+str(image_counter)+".jpg"
page.save(output_folder+filename, 'JPEG') 
image_counter = image_counter + 1

for i in os.listdir(output_folder):
if i.endswith('.ppm'):
os.remove(output_folder+i)

要从图像中提取文本：

你的tesseract路径是这样的：r'C:Program FilesTesseract-OCRtesseract.exe'

def imgtotext(img, tesseract_path):
# Recognize the text as string in image using pytesserct 
pytesseract.pytesseract.tesseract_cmd = tesseract_path
text = str(((pytesseract.image_to_string(Image.open(img))))) 
text = text.replace('-n', '')

return text

我最近开始使用PyMuPDF。它的授权有点令人困惑，但他们的一些方法有办法正确地对文本进行排序，使其自然出现(从左到右，从上到下(。只需要像page.get_text("words"，sort=True(这样的东西。

相关内容

最新更新

热门标签：