在python中提取以转义符结尾的文本



我正试图通过python解析PDF论文的关键细节,并提取论文标题、作者及其电子邮件

from PyPDF2 import PdfReader
reader = PdfReader("paper.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() + "n"

返回PDF 的原始文本

'TitlenGoesnHerenAuthor Name (sdsd@mail.net)nUniversity of TeeyabnSeptember 6, 2022nSome text in the Document.n'

我有一个删除换行符和制表符等的功能

def remove_newlines_tabs(text):
"""
This function will remove all the occurrences of newlines, tabs, and combinations like: \n, \.

arguments:
input_text: "text" of type "String". 

return:
value: "text" after removal of newlines, tabs, \n, \ characters.

Example:
Input : This is her \ first day at this place.n Please,t Be nice to her.\n
Output : This is her first day at this place. Please, Be nice to her. 

"""

# Replacing all the occurrences of n,\n,t,\ with a space.
Formatted_text = text.replace('\n', ' ').replace('n', ' ').replace('t',' ').replace('\', ' ').replace('. com', '.com')
return Formatted_text

返回

'Title Goes Here Author Name (sdsd@mail.net) University of Teeyab September 6, 2022 Some text in the Document. '

这使得提取电子邮件变得容易。如何提取PDF的标题和作者?标题是最重要的,但我不确定最好的方法。。。

以下是基于以下假设使用regex的解决方案

  • 标题中的每个单词都由换行符n分隔
  • 作者的每个单词都用空格隔开
  • 电子邮件地址总是用括号()括起来
import re

test_string = 'TitlenGoesnHerenAuthor Name (sdsd@mail.net)nUniversity of TeeyabnSeptember 6, 2022nSome text in the Document.n'
# w matches characters, numbers, and underscore
# s matches whitespace and tnrfv
# first, let's extract string that appears before parentheses
result = re.search(r"([ws]+)", test_string)
print(result) # <re.Match object; span=(0, 28), match='TitlenGoesnHerenAuthor Name '>
# clean up leading and trailing whitespaces using strip() and
# split the string by n to separate title and author
title_author = result[0].strip().split("n")
print(title_author) # ['Title', 'Goes', 'Here', 'Author Name']
# join the words of title as a single string
title = " ".join(title_author[:-1])
author = title_author[-1]
print(title) # Title Goes Here
print(author) # Author Name

最新更新