正则表达式匹配单个点，但不匹配数字或沉默

我正在为教程开发一个句子器和分词器。这意味着将文档字符串拆分为句子，将句子拆分为单词。例子：

#Sentencizing
"This is a sentence. This is another sentence! A third..."=>["This is a sentence.", "This is another sentence!", "A third..."]
#Tokenizatiion
"Tokens are 'individual' bits of a sentence."=>["Tokens", "are", "'individual'", "bits", "of", "a", "sentence", "."]

正如所见，需要的不仅仅是一个string.split((。我正在使用 re.sub(( 为每个匹配项附加一个"特殊"标签(后来在此标签中拆分(，首先用于句子，然后用于标记。

到目前为止，它工作得很好，但有一个问题：如何制作一个可以在点处拆分但不能在 (...( 或数字 (3.14( 处拆分的正则表达式？

我一直在使用这些选项进行展望(我需要匹配组，然后能够调用它进行追加(，但没有一个有效：

#Do a negative look behind for preceding numbers or dots, central capture group is a dot, do the same as first for a look ahead.
(?![d.])(.)(?<![d.])

该应用程序是：

sentence = re.sub(pattern, 'g<0>'+special_tag, raw_sentence)

我使用以下内容来查找它看起来相关的时期：

import re
m = re.compile(r'[0-9].[^0-9.]|[^0-9].[^0-9.]|[!?]')
st = "This is a sentence. This is another sentence! A third...  Pi is 3.14.  This is 1984.  Hello?"
m.findall(st)
# if you want to use lookahead, you can use something like this:
m = re.compile(r'(?<=[0-9]).(?=[^0-9.])|(?<=[^0-9]).(?=[^0-9.])|[!?]')

这不是特别优雅，但我也试图处理"我们有0.1%的成功机会"的情况。

祝你好运！

这可能是矫枉过正，或者需要一些清理，但这是我能想到的最好的正则表达式：

((([^.n ]+|(.+d+))b[^.]? ?)+)([.?!)"]+)

分解一下：

[^.n ]+    // Matches 1+ times any char that isn't a dot, newline or space.
(.+d+)     // Captures the special case of decimal numbers
b[^.]? ?   // b is a word boundary. This may be optionally  
// followed by any non-dot character, and optionally a space.

所有这些先前的部分都是匹配 1+ 次。为了确定一个句子已经完成，我们使用以下内容：

[.?!)"] // Matches any of the common sentences terminators 1+ times

试试吧！

相关内容

最新更新

热门标签：