我尝试使用 PyPDF2 和 pdfminer.six
提取元数据,得到:
reader = PdfFileReader("example.pdf")
info = pdf.getDocumentInfo()
获取响应:
{'/Title': IndirectObject(38, 0), '/Author': IndirectObject(40, 0), '/Subject': IndirectObject(41, 0), '/Producer': IndirectObject(39, 0), '/Creator': IndirectObject(42, 0), '/CreationDate': IndirectObject(43, 0), '/ModDate': IndirectObject(43, 0)}
使用 pdfrw
有了pdfrw
它的工作原理是这样的:
from pdfrw import PdfReader
>>> PdfReader(<filename>).Info
这现在是 PyPDF2 文档的一部分:
from PyPDF2 import PdfFileReader
reader = PdfFileReader("example.pdf")
info = reader.getDocumentInfo()
print(reader.numPages)
# All of the following could be None!
print(info.author)
print(info.creator)
print(info.producer)
print(info.subject)
print(info.title)