使用正则表达式将PDF文件中的数据导出到excel



我正在使用regex获取PDF文件中的某些字符串,并将它们写入excel文件。我的PDF文件的内容如下:

Task 1: Question 1? answer1
Task 2: Question 2? (Format:****) answer2
Task 3: Question 3? answer3
Task 4: Question 4? (Format:*****) answer4

我想做的是忽略CCD_ 1。。,对于其他人来说,regex运行良好,我该如何做到这一点?,所以excel应该如下。

Excel

这是我的代码:

import re
import pandas as pd
from pdfminer.high_level import extract_pages, extract_text
text = extract_text("file.pdf")
pattern1 = re.compile(r":s*(.*?)")
pattern2 = re.compile(r".*?s*(.*)")
matches1 = pattern1.findall(text)
matches2 = pattern2.findall(text)
df = pd.DataFrame({'Soru-TR': matches1})
df['Cevap'] = matches2
writer = pd.ExcelWriter('Questions.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False)
writer.save()

您可以使用带有两个捕获组的单个模式,并可以在匹配问号后选择匹配括号之间的部分。

^[^:]*:s*([^?]+?)s+(?:([^()]*)?s)?(.*)

解释

  • ^字符串开始
  • [^:]*:匹配除:之外的任何字符,然后匹配:
  • s*匹配可选空白字符
  • ([^?]+?)捕获组1,匹配除?之外的1+个字符,然后匹配?
  • (Format:****)0匹配1+空白字符
  • (?:([^()]*)?s)?可选择从开盘到收盘进行匹配(...)
  • (.*)捕获组2,匹配线路的其余部分

查看regex演示。

示例代码

import re
pattern = r"^[^:]*:s*([^?]+?)s+(?:([^()]*)?s)?(.*)"
s = ("Task 1: Question 1? answer1n"
"Task 2: Question 2? (Format:****) answer2n"
"Task 3: Question 3? answer3n"
"Task 4: Question 4? (Format:*****) answer4")
matches = re.finditer(pattern, s, re.MULTILINE)
matches1 = []
matches2 = []
for matchNum, match in enumerate(matches, start=1):
matches1.append(match.group(1))
matches2.append(match.group(2))
print(matches1)
print(matches2)

输出

['Question 1?', 'Question 2?', 'Question 3?', 'Question 4?']
['answer1', 'answer2', 'answer3', 'answer4']

最新更新