修复中间有换行的句子:Python is 是有趣的



我目前正在使用Apache Tika从pdf中提取文本。我使用NLTK来做命名实体识别和其他任务。我遇到了一个问题,在pdf文档中提取的句子中间有换行符。例如,

我是一个句子,中间有一个python行nbreak

模式通常是一个空格后跟一个换行符,<space>n<space>n<space>。我想修复这些句子,这样我就可以在它们上面使用句子标记器。

我正在尝试使用正则表达式模式,(.+?)(?:rn|n)(.+[.!?]+[s|$])替换n

问题:

  1. 一个句子开始于另一个句子结束后的同一行不匹配。
  2. 如何匹配有多行换行的句子?换句话说,我如何允许多次出现S ?

    text = """
    Random Data, Company
    2015
    This is a sentence that has line 
    break in the middle of it due to extracting from a PDF.
    How do I support
    3 line sentence 
    breaks please?
    HEADER HERE
    The first sentence will 
    match. However, this line will not match
    for some reason 
    that I cannot figure out.
    Portfolio: 
    http://DoNotMatchMeBecauseIHaveAPeriodInMe.com 
    Full Name 
    San Francisco, CA  
    94000
    1500 testing a number as the first word in
    a broken sentence.
    Match sentences with capital letters on the next line like 
    Wi-Fi.
    This line has 
    trailing spaces after exclamation mark!       
    """
    import re
    new_text = re.sub(pattern=r'(.+?)(?:rn|n)(.+[.!?]+[s|$])', repl='g<1>g<2>', string=text, flags=re.MULTILINE)
    print(new_text)
    expected_result = """
    Random Data, Company
    2015
    This is a sentence that has line break in the middle of it due to extracting from a PDF.
    How do I support 3 line sentence breaks please?
    HEADER HERE
    The first sentence will match. However, this line will not match for some reason that I cannot figure out.
    Portfolio: 
    http://DoNotMatchMeBecauseIHaveAPeriodInMe.com 
    Full Name 
    San Francisco, CA  
    94000
    1500 testing a number as the first word in a broken sentence.
    Match sentences with capital letters on the next line like Wi-Fi.
    This line has trailing spaces after exclamation mark!       
    """
    

regex101.com

正则表达式不匹配末尾有空格的行,这就是被分成3行的句子的情况。结果,句子没有被合并成一个句子。

这里有一个替代的正则表达式,它将两个空行之间的所有行连接成一个,确保连接行之间只有一个空格:

# The new regex
(S)[ t]*(?:rn|n)[ t]*(S)
# The replacement string: 1 2

说明这将搜索任何非空格字符S,后跟一个新行,然后后跟空格,然后再后跟s。它将换行符和两个"S"之间的空格替换为一个空格。空格和制表符是显式给出的,因为CC_9也匹配新行。下面是演示链接

最新更新