我有一个要求,即从 PDF 中提取矩形中的文本。我已经测试了几种方法。但没有得到特定的文本。例如,我用PyMuPDF,pdfplumber,tabula,camelot,pdftables软件包进行了测试。在PyMuPDF模块中,它要求开始和结束单词来提取文本。据我了解,剩余的包也只是提取线条,曲线信息而不是文本。
我想从 PDF 中的矩形获取文本,而无需提供任何开始和结束文本。
https://drive.google.com/file/d/1wCvik7VbEvDwbT-mapgXc8fwlq7Ao3BP/view?usp=sharing
您可以使用下面的代码
import PyPDF2
def convert_pdf_to_text (document):
read_pdf = PyPDF2.PdfFileReader(document, strict=False)
number_of_pages = read_pdf.getNumPages()
alltext1=""
for page_number in range(number_of_pages):
page = read_pdf.getPage(page_number)
alltext1 += page.extractText()
return alltext1.replace("n", "")
convert_pdf_to_text ('pdf_test.pdf')
输出
'A Simple PDF File This is a small demonstration .pdf file - just for use in the Virtual Mechanics tutorials. More text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Boring, zzzzz. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Even more. Continued on page 2 ... Details State: State_name City: City_name Country: Country_name Rig No: 4455555 Source Id: k4-3k44 '
您可以使用PyMuPDF 模块中Page.get_textbox
的方法。
例如:
import fitz
doc = fitz.open('pdf_test.pdf')
page = doc[0] # get first page
rect = fitz.Rect(0, 0, 600, page.rect.width) # define your rectangle here
text = page.get_textbox(rect) # get text from rectangle
clean_text = ' '.join(text.split())
print(clean_text)
相关文档:
Page.get_textbox
fitz.Rect
fitz.Page