PDF 抓取不会使用 PyPDF2 加载文本



我试图从PDF列表中提取所有文本,但从对象中提取文本时遇到错误。知道是什么原因吗?

ls = os.listdir(resumes)
pdf = [s for s in ls if '.pdf' in s]
print(pdf)
for p in pdf:
pdfFileObj = open(os.path.join(resumes, p), 'rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
print(pdfReader.numPages)
pageObj = pdfReader.getPage(0) 
print(pageObj.extractText()) 
pdfFileObj.close() 

错误:

File "C:Program FilesPython39libencodingscp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character 'u0141' in position 305: character maps to <undefined>

用pdfplumber试试这个:

import pdfplumber
import os 
resumes = "C:\path\to\resumes\"
ls = os.listdir(resumes)
pdf_files = [s for s in ls if '.pdf' in s]
alltext = ""
for pdf_file in pdf_files:
pdf_path = resumes + pdf_file
pdf = pdfplumber.open(pdf_path)
nb_pages = len(pdf.pages)
print(nb_pages)
for n in range(0, nb_pages): # if you want to extract text from all the document
p = pdf.pages[n]
text = p.extract_text()
if text is None:
continue 
alltext += text
print(alltext) # This is will print all the text
alltext = "" # reinitialize this variable

最新更新