在有条件的情况下,如何在一个PDF文件中组合多个页面



我使用正则表达式从一个大型PDF文件中提取学生ID,并将关联的学生页面保存在一个单独的PDF文件中。我已经使用这个程序很长一段时间了,然而,我面临着这样的问题,有时两个页面可能与一个学生的ID相关联。因此,我必须将两个页面合并到一个PDF文件中。

我创建了两个列表,一个用于页码,另一个用于学生ID。正如您在std_ID列表中看到的那样,两个(索引2和3(重复,这意味着编号为2和3的页面必须单独组合,总共产生4个pdf文件。

pagesNumber = [0, 1, 2, 3, 4] 
std_ID = ['331401142', '439233718', '440113239', '440113239', '440113245']

我目前使用的代码片段将生成4个pdf文件,其中索引2的页面被完全忽略,导致440113239.pdf将只有索引3的页面。有人能帮忙吗?

def split(self, folder):
# file_base_name = self.pdf.replace('pdf', '')
output_folder_path = os.path.join(os.getcwd(), folder)
for page, ID in zip(pagesNumber ,std_ID ):
pdfWriter = PdfFileWriter()
pdfWriter.addPage(reader.getPage(page))
with open(os.path.join(output_folder_path, '{0}.pdf'.format(ID)), 'wb') as f:
pdfWriter.write(f)
f.close()

完整代码:

from PyPDF2 import PdfFileReader, PdfFileWriter
from time import time
import re, os


class Extract_Lab:
global folder_Path, std_ID , pageNumber
folder_name = "Rayat_DataRAYAT_FILES"
folder_Path = os.path.join(os.getcwd(), folder_name)
pageNumber = []
std_ID = []



def __init__(self, pattern, pdf):
self.pattern = pattern
self.pdf = pdf


def run(self):
file = open(fr"{folder_Path}{self.pdf}.pdf", 'rb')
global reader 
reader = PdfFileReader(file)
for page in range(reader.numPages):
sevPage = reader.getPage(page)
pdfData = str(sevPage.extractText())
match = re.findall(self.pattern,pdfData)
for m in match: 
std_ID.append(m)
pageNumber.append(page)
return std_ID, pageNumber

def split(self, folder):
# file_base_name = self.pdf.replace('pdf', '')
output_folder_path = os.path.join(os.getcwd(), folder)
for page, ID in zip(pageNumber ,std_ID):
pdfWriter = PdfFileWriter()
pdfWriter.addPage(reader.getPage(page))
with open(os.path.join(output_folder_path, '{0}.pdf'.format(ID)), 'wb') as f:
pdfWriter.write(f)
f.close()

def main():
start = time()


itmatch = '((?!1750111)[1|2|3|4]d{8})'
file_pdf = "SS05"
obj = Extract_Lab(itmatch, file_pdf)
lab, page = obj.run()
obj.split(r"Rayat_DataTrainees_Tables")

print(f'Time taken: {time() - start}')

if __name__ == '__main__':
main()

假设std_id==pagesNumber和std_id排序为

如果我的假设是真的,你可以实现一个指针和计数器。使用计数器查找有多少连接ID是重复的。将需要使用指针来访问列表数据。

示例:

pointer = 0
length = len(pagesNumber) -1
while pointer <= length:    

# Last page/id has been hit save the last page
if pointer == length:
## save last page 
#using std_id[pointer],pagesNumber[pointer]
break

count = 1
# while loop to count how many duplicate ID's are in a row
while std_id[pointer] == std_id[pointer + 1]:
count+=1
pointer+=1

# the final ID is a duplicate
if pointer == length and std_id[pointer] == std_id[pointer + 1]:
count+=1 
break
# last value is not a duplicate break while loop
elif pointer == length:
break 

#If count > 1 
## do PDF joining for number of pages in count working back
## else just process one page
pointer+=1

最新更新