从文件夹中的多个PDF文件中提取电子邮件地址、名字和姓氏



我正试图从文件夹中的所有PDF文件中提取以下信息,PDF文件是简历:工作项目的电子邮件地址、名字和姓氏。

我已经成功地提取电子邮件地址使用这个代码:

from io import StringIO
from pdfminer3.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer3.converter import TextConverter
from pdfminer3.layout import LAParams
from pdfminer3.pdfpage import PDFPage
import subprocess
from subprocess import call
import os
import re
working_directory = './folder'
file_list = []   # define file_list to save all dxf files
email_list = {}   # define file_list to save all dxf files
for subdir, dirs, files in os.walk(working_directory):
for file in files:
if file.endswith('.pdf'):
file_list.append(file)

for input_file in file_list:

pagenums = set()

output = StringIO()

manager = PDFResourceManager()

converter = TextConverter(manager, output, laparams=LAParams())

interpreter = PDFPageInterpreter(manager, converter)

infile = open('./folder/' + input_file, 'rb')

for page in PDFPage.get_pages(infile, pagenums):
interpreter.process_page(page)

infile.close()

converter.close()

text = output.getvalue()

output.close()

match = re.search(r'[w.-]+@[a-z0-9.-]+', text)

try:
email = match.group(0)
except AttributeError:
email = match

if email is None:
pass
else:
email_list.update({input_file: email})
print(email_list[input_file])

email_list

但是如果在提取名字和姓氏时遇到问题,我们将不胜感激!

您可以找到电子邮件信息,因为它背后有逻辑

match = re.search(r'[w.-]+@[a-z0-9.-]+', text)

但你也必须找出一个逻辑,找出你的PDF文件的名字和姓氏。

可能是Dear,之后的特定字段,例如

最新更新