Python 正则表达式来查找多个连续的标点符号



我正在通过MapReduce流式传输纯文本记录,需要检查每个纯文本记录是否有2个或更多连续的标点符号。我需要检查的 12 个符号是:-/()!"+,'&.

我尝试将此标点符号列表转换为如下所示的数组:标点符号 = [r'-', r'/', r'\', r'(', r')', r'!', r'"', r'+', r',', r"'", r'&', r'.']

我可以找到带有嵌套 for 循环的单个字符,例如:

for t in test_cases:
    print t
    for p in punctuation:
        print p
        if re.search(p, t):
            print 'found a match!', p, t
        else:
            print 'no match'

但是,当我对此进行测试时,找不到单个反斜杠字符,并且我不知道如何仅获得连续出现 2 次或更多次的结果。我已经读到我需要使用 + 符号,但不知道使用它的正确语法。

以下是一些测试用例:

The quick '''brown fox
The &&quick brown fox
The quickbrown fox
The quick\brown fox
The -quick brown// fox
The quick--brown fox
The (quick brown) fox,,,
The quick ++brown fox
The "quick brown" fox
The quick/brown fox
The quick&brown fox
The ""quick"" brown fox
The quick,, brown fox
The quick brown fox…
The quick-brown fox
The ((quick brown fox
The quick brown)) fox
The quick brown fox!!!
The 'quick' brown fox

当翻译成 Python 列表时,它看起来像这样:

test_cases = [
"The quick '''brown fox",
'The &&quick brown fox',
'The quick\brown fox',
'The quick\\brown fox',
'The -quick brown// fox',
'The quick--brown fox',
'The (quick brown) fox,,,',
'The quick ++brown fox',
'The "quick brown" fox',
'The quick/brown fox',
'The quick&brown fox',
'The ""quick"" brown fox',
'The quick,, brown fox',
'The quick brown fox...',
'The quick-brown fox',
'The ((quick brown fox',
'The quick brown)) fox',
'The quick brown fox!!!',
"The 'quick' brown fox" ]

如何使用 Python 正则表达式来识别和报告标点符号连续出现 2 次或更多次的所有匹配项?

标点符号可以放入字符类中是方括号。然后,这取决于两个或多个标点字符的系列是否由任何标点字符组成,或者标点符号是否相同。

在第一种情况下,可以附加大括号以指定最小 (2) 和最大重复次数。后者是无界的,留空:

[...]{2,} # min. 2 or more

如果只需要找到相同字符的重复,则将第一个匹配的标点符号放入一个组中。然后,同一组(=同一字符)跟随一个或多个:

([...])1+

反向引用1表示表达式中的第一组。由左括号表示的组从左到右编号。

下一个问题是逃跑。Python 字符串有转义规则,正则表达式中需要额外的转义。字符类不需要太多转义,但反斜杠必须加倍。因此,下面的示例将反斜杠四倍,一个因为字符串而加倍,第二个因为正则表达式而加倍。

原始字符串r'...'对模式很有用,但这里需要单引号和双引号。

>>> import re
>>> test_cases = [
    "The quick '''brown fox",
    'The &&quick brown fox',
    'The quick\brown fox',
    'The quick\\brown fox',
    'The -quick brown// fox',
    'The quick--brown fox',
    'The (quick brown) fox,,,',
    'The quick ++brown fox',
    'The "quick brown" fox',
    'The quick/brown fox',
    'The quick&brown fox',
    'The ""quick"" brown fox',
    'The quick,, brown fox',
    'The quick brown fox...',
    'The quick-brown fox',
    'The ((quick brown fox',
    'The quick brown)) fox',
    'The quick brown fox!!!',
    "The 'quick' brown fox" ]
>>> pattern_any_punctuation = re.compile('([-/\\()!"+,&'.]{2,})')
>>> pattern_same_punctuation = re.compile('(([-/\\()!"+,&'.])\2+)')
>>> for t in test_cases:
    match = pattern_same_punctuation.search(t)
    if match:
        print("{:24} => {}".format(t, match.group(1)))
    else:
        print(t)
The quick '''brown fox   => '''
The &&quick brown fox    => &&
The quickbrown fox
The quick\brown fox     => \
The -quick brown// fox   => //
The quick--brown fox     => --
The (quick brown) fox,,, => ,,,
The quick ++brown fox    => ++
The "quick brown" fox
The quick/brown fox
The quick&brown fox
The ""quick"" brown fox  => ""
The quick,, brown fox    => ,,
The quick brown fox...   => ...
The quick-brown fox
The ((quick brown fox    => ((
The quick brown)) fox    => ))
The quick brown fox!!!   => !!!
The 'quick' brown fox
>>> 

可以在正则表达式中使用{2}来匹配字符类的两个连续出现:

>>> regex = re.compile(r'[-/()!"+,'&]{2}')
>>> [s for s in test_cases if regex.search(s)]
["The quick '''brown fox",
 'The &&quick brown fox',
 'The -quick brown// fox',
 'The quick--brown fox',
 'The (quick brown) fox,,,',
 'The quick ++brown fox',
 'The ""quick"" brown fox',
 'The quick,, brown fox',
 'The ((quick brown fox',
 'The quick brown)) fox',
 'The quick brown fox!!!']

正则表达式呢?这也有助于找到 2 个或更多连续的标点符号。

正则表达式喜欢([\-/()!"+,'&]{2,})g

{2,}代表两个或更多

g代表全局搜索,不要停留在第一场比赛中

感谢@Heiko Oberdiek,这是我使用的解决问题的确切代码:(我在标点符号列表中添加了 .

punctuation = re.compile('(([-/\\()!"+,&'.])\2+)')
x = 1
for t in test_cases:
    match = punctuation.search(t)
    if match:
        print "{0:2} {1:24} => {2}".format(x, t, match.group(1))
        x += 1

这准确地涵盖了我的所有测试用例:

 1 The quick '''brown fox   => '''
 2 The &&quick brown fox    => &&
 3 The quick\brown fox     => \
 4 The -quick brown// fox   => //
 5 The quick--brown fox     => --
 6 The (quick brown) fox,,, => ,,,
 7 The quick ++brown fox    => ++
 8 The ""quick"" brown fox  => ""
 9 The quick,, brown fox    => ,,
10 The quick brown fox...   => ...
11 The ((quick brown fox    => ((
12 The quick brown)) fox    => ))
13 The quick brown fox!!!   => !!!

相关内容

  • 没有找到相关文章

最新更新