从pdf中提取矩形中的文本 - Python - Extract text in a rectangle from pdf

我有一个要求，即从 PDF 中提取矩形中的文本。我已经测试了几种方法。但没有得到特定的文本。例如，我用PyMuPDF，pdfplumber，tabula，camelot，pdftables软件包进行了测试。在PyMuPDF模块中，它要求开始和结束单词来提取文本。据我了解，剩余的包也只是提取线条，曲线信息而不是文本。

我想从 PDF 中的矩形获取文本，而无需提供任何开始和结束文本。

https://drive.google.com/file/d/1wCvik7VbEvDwbT-mapgXc8fwlq7Ao3BP/view?usp=sharing

您可以使用下面的代码

import PyPDF2
def convert_pdf_to_text (document):
read_pdf = PyPDF2.PdfFileReader(document, strict=False)
number_of_pages = read_pdf.getNumPages()
alltext1=""
for page_number in range(number_of_pages):
page = read_pdf.getPage(page_number)
alltext1 += page.extractText()
return alltext1.replace("n", "")
convert_pdf_to_text ('pdf_test.pdf')

输出

'A Simple PDF File  This is a small demonstration .pdf file - just for use in the Virtual Mechanics tutorials. More text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Boring, zzzzz. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Even more. Continued on page 2 ...  Details  State: State_name     City: City_name    Country: Country_name     Rig No: 4455555  Source Id: k4-3k44 '

您可以使用PyMuPDF 模块中Page.get_textbox的方法。

例如：

import fitz
doc = fitz.open('pdf_test.pdf')
page = doc[0]  # get first page
rect = fitz.Rect(0, 0, 600, page.rect.width)  # define your rectangle here
text = page.get_textbox(rect)  # get text from rectangle
clean_text = ' '.join(text.split())
print(clean_text)

从pdf中提取矩形中的文本 - Python

相关内容

最新更新

热门标签：