使用io和PyPDF2从PDF url中提取文本没有输出



我试图从pdf url中提取文本。如果我下载了PDF,我可以很容易地用slate函数提取文本。但是,当尝试使用io导入pdf并提取文本时,返回的输出是空的。代码如下所示。

import requests, PyPDF2, io
from io import BytesIO
url = 'https://www.poderjudicial.es/search/contenidos.action?action=accessToPDF&publicinterface=true&tab=AN&reference=e3ca421447bc6b71&encode=true&optimize=20210216&databasematch=AN'
response = requests.get(url)
f = io.BytesIO(response.content)
with f as data:
read_pdf = PyPDF2.PdfFileReader(data)
page = read_pdf.getPage(1)
print(page.extractText())

我已经尝试了一堆其他功能,但不工作。我做错了什么吗?

它也给了我空白的输出。我不知道为什么。但是你试过使用pdfminer3吗?它为我提供了正确的文本输出。下面的代码给出了该文件的正确输出。

import requests
from pdfminer3.layout import LAParams, LTTextBox
from pdfminer3.pdfpage import PDFPage
from pdfminer3.pdfinterp import PDFResourceManager
from pdfminer3.pdfinterp import PDFPageInterpreter
from pdfminer3.converter import PDFPageAggregator
from pdfminer3.converter import TextConverter
import io
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)
url = 'https://www.poderjudicial.es/search/contenidos.action?action=accessToPDF&publicinterface=true&tab=AN&reference=e3ca421447bc6b71&encode=true&optimize=20210216&databasematch=AN'
response = requests.get(url)
f = io.BytesIO(response.content)
with f as fh:
for page in PDFPage.get_pages(fh,
caching=True,
check_extractable=True):
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
# close open handles
converter.close()
fake_file_handle.close()
print(text)

也可以看看这篇文章如何使用PDFminer。3 .

最新更新