Python 3 正则表达式:删除所有标点符号，特殊单词模式除外

我有这个文本模式-ABC_ABC- 或 -ABC- 或 -ABC_ABC_ABC-

我的正则表达式模式：

([-]+[A-Z]+(?:[_]?[A-Z])+[-]+)

我想删除除上述模式之外的所有字符串标点符号。我可以对这样的情况使用正则表达式替换吗？

输入字符串：

Lorem Ipsum, simply dummy text -TOKEN_ABC-, yes!

期望：

Lorem Ipsum simply dummy text -TOKEN_ABC- yes

我已经完成了 if 方法的使用，但它感觉效率较低，因为我必须检查每个单词。

sentence_list=[]
for word in text:
if re.match(r"([-][A-Z]+(?:[_]?[A-Z]*[-]))", word.text):
sentence_list.append(word.text)
else:
text2 = re.sub(r"([^ws]|[-_])", r"", word.text)
sentence_list.append(text2)
return " ".join(sentence_list)

使用regex模块而不是re动词(*SKIP)(*FAIL)：

import regex
text = 'Lorem Ipsum, simply dummy text -TOKEN_ABC-, yes! '
res = regex.sub(r'-[A-Z]+(?:_[A-Z]+)*-(*SKIP)(*FAIL)|[^ws]+', '', text)
print (res)

输出：

Lorem Ipsum simply dummy text -TOKEN_ABC- yes

解释：

-               # a hyphen
[A-Z]+          # 1 or more capitals
(?:             # non capture group
_             # underscore
[A-Z]+        # 1 or more capitals
)*              # end group, may appear 0 or more times
-               # a hyphen
(*SKIP)         # forget the match
(*FAIL)         # and fail
|                 # OR
[^ws]+        # 1 or more non word characters or spaces

相关内容

最新更新

热门标签：