小贝子编程

使用PDFMiner.Six读取pdf到xml到内存的问题

本文关键字：xml 内存问题 pdf PDFMiner Six 读取使用 python python-3.x pdfminer
更新时间 : 2023-09-22
英文 : Problem reading pdf to xml into memory using PDFMiner.Six

考虑以下代码片段:

import io
result = io.StringIO()
with open("file.pdf") as fp:
extract_text_to_fp(fp, result, output_type='xml')
data = result.getvalue()

这会导致以下错误

ValueError: Codec is required for a binary I/O output

如果我忽略output_type，我得到错误

`UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 3804: character maps to <undefined>` instead.

我不明白为什么会发生这种情况，希望你能帮我解决这个问题。

我找到了解决问题的方法:首先，您需要以二进制模式打开"file.pdf"。然后，如果你想读到内存，使用BytesIO而不是StringIO并解码。例如

import io
result = io.BytesIO()
with open("file.pdf", 'rb') as fp:
extract_text_to_fp(fp, result, output_type='xml')
data = result.getvalue().decode("utf-8")

使用PDFMiner.Six读取pdf到xml到内存的问题

相关内容

最新更新

热门标签：