我正在通过MapReduce流式传输纯文本记录,需要检查每个纯文本记录是否有2个或更多连续的标点符号。我需要检查的 12 个符号是:-/()!"+,'&.
。
我尝试将此标点符号列表转换为如下所示的数组:标点符号 = [r'-', r'/', r'\', r'(', r')', r'!', r'"', r'+', r',', r"'", r'&', r'.']
我可以找到带有嵌套 for 循环的单个字符,例如:
for t in test_cases:
print t
for p in punctuation:
print p
if re.search(p, t):
print 'found a match!', p, t
else:
print 'no match'
但是,当我对此进行测试时,找不到单个反斜杠字符,并且我不知道如何仅获得连续出现 2 次或更多次的结果。我已经读到我需要使用 + 符号,但不知道使用它的正确语法。
以下是一些测试用例:
The quick '''brown fox
The &&quick brown fox
The quickbrown fox
The quick\brown fox
The -quick brown// fox
The quick--brown fox
The (quick brown) fox,,,
The quick ++brown fox
The "quick brown" fox
The quick/brown fox
The quick&brown fox
The ""quick"" brown fox
The quick,, brown fox
The quick brown fox…
The quick-brown fox
The ((quick brown fox
The quick brown)) fox
The quick brown fox!!!
The 'quick' brown fox
当翻译成 Python 列表时,它看起来像这样:
test_cases = [
"The quick '''brown fox",
'The &&quick brown fox',
'The quick\brown fox',
'The quick\\brown fox',
'The -quick brown// fox',
'The quick--brown fox',
'The (quick brown) fox,,,',
'The quick ++brown fox',
'The "quick brown" fox',
'The quick/brown fox',
'The quick&brown fox',
'The ""quick"" brown fox',
'The quick,, brown fox',
'The quick brown fox...',
'The quick-brown fox',
'The ((quick brown fox',
'The quick brown)) fox',
'The quick brown fox!!!',
"The 'quick' brown fox" ]
如何使用 Python 正则表达式来识别和报告标点符号连续出现 2 次或更多次的所有匹配项?
标点符号可以放入字符类中是方括号。然后,这取决于两个或多个标点字符的系列是否由任何标点字符组成,或者标点符号是否相同。
在第一种情况下,可以附加大括号以指定最小 (2) 和最大重复次数。后者是无界的,留空:
[...]{2,} # min. 2 or more
如果只需要找到相同字符的重复,则将第一个匹配的标点符号放入一个组中。然后,同一组(=同一字符)跟随一个或多个:
([...])1+
反向引用1
表示表达式中的第一组。由左括号表示的组从左到右编号。
下一个问题是逃跑。Python 字符串有转义规则,正则表达式中需要额外的转义。字符类不需要太多转义,但反斜杠必须加倍。因此,下面的示例将反斜杠四倍,一个因为字符串而加倍,第二个因为正则表达式而加倍。
原始字符串r'...'
对模式很有用,但这里需要单引号和双引号。
>>> import re
>>> test_cases = [
"The quick '''brown fox",
'The &&quick brown fox',
'The quick\brown fox',
'The quick\\brown fox',
'The -quick brown// fox',
'The quick--brown fox',
'The (quick brown) fox,,,',
'The quick ++brown fox',
'The "quick brown" fox',
'The quick/brown fox',
'The quick&brown fox',
'The ""quick"" brown fox',
'The quick,, brown fox',
'The quick brown fox...',
'The quick-brown fox',
'The ((quick brown fox',
'The quick brown)) fox',
'The quick brown fox!!!',
"The 'quick' brown fox" ]
>>> pattern_any_punctuation = re.compile('([-/\\()!"+,&'.]{2,})')
>>> pattern_same_punctuation = re.compile('(([-/\\()!"+,&'.])\2+)')
>>> for t in test_cases:
match = pattern_same_punctuation.search(t)
if match:
print("{:24} => {}".format(t, match.group(1)))
else:
print(t)
The quick '''brown fox => '''
The &&quick brown fox => &&
The quickbrown fox
The quick\brown fox => \
The -quick brown// fox => //
The quick--brown fox => --
The (quick brown) fox,,, => ,,,
The quick ++brown fox => ++
The "quick brown" fox
The quick/brown fox
The quick&brown fox
The ""quick"" brown fox => ""
The quick,, brown fox => ,,
The quick brown fox... => ...
The quick-brown fox
The ((quick brown fox => ((
The quick brown)) fox => ))
The quick brown fox!!! => !!!
The 'quick' brown fox
>>>
可以在正则表达式中使用{2}
来匹配字符类的两个连续出现:
>>> regex = re.compile(r'[-/()!"+,'&]{2}')
>>> [s for s in test_cases if regex.search(s)]
["The quick '''brown fox",
'The &&quick brown fox',
'The -quick brown// fox',
'The quick--brown fox',
'The (quick brown) fox,,,',
'The quick ++brown fox',
'The ""quick"" brown fox',
'The quick,, brown fox',
'The ((quick brown fox',
'The quick brown)) fox',
'The quick brown fox!!!']
正则表达式呢?这也有助于找到 2 个或更多连续的标点符号。
正则表达式喜欢([\-/()!"+,'&]{2,})g
{2,}
代表两个或更多
g
代表全局搜索,不要停留在第一场比赛中
感谢@Heiko Oberdiek,这是我使用的解决问题的确切代码:(我在标点符号列表中添加了 .
punctuation = re.compile('(([-/\\()!"+,&'.])\2+)')
x = 1
for t in test_cases:
match = punctuation.search(t)
if match:
print "{0:2} {1:24} => {2}".format(x, t, match.group(1))
x += 1
这准确地涵盖了我的所有测试用例:
1 The quick '''brown fox => '''
2 The &&quick brown fox => &&
3 The quick\brown fox => \
4 The -quick brown// fox => //
5 The quick--brown fox => --
6 The (quick brown) fox,,, => ,,,
7 The quick ++brown fox => ++
8 The ""quick"" brown fox => ""
9 The quick,, brown fox => ,,
10 The quick brown fox... => ...
11 The ((quick brown fox => ((
12 The quick brown)) fox => ))
13 The quick brown fox!!! => !!!