正则表达式不匹配带有括号和 \t 的字符串

我使用这段代码使用regex分割字符串

suffixes = "(adj.|adv.|pron.|num.|num.-m|conj.|part.|aux.|prep.|n.|v.|m.)"
regex = f'^(w+?)((?:{suffixes}) .*)$'
result = re.sub(regex, "1#2", re.escape(word), re.UNICODE).split("#")

代码工作得很好，几乎所有的字符串，我有，但我面临的问题与这两个字符串:'qiāntnum. thousand'和'jiànm. (used for clothes among other items) piece'。看起来图案不匹配，我想是因为t和()这两个特殊的字符。

My expected result['qiānt', 'num. thousand']and[ 'jiàn', 'm. (used for clothes among other items) piece'].

我相信你的数据是通过一些API损坏的。简单的修复方法:如果你不期望输入字符串中的和制表符，用t替换所有制表符。

使用

import re
suffixes = r"(?:adj.|adv.|pron.|num.|num.-m|conj.|part.|aux.|prep.|n.|v.|m.)"
regex = fr'^(w+?)((?:{suffixes}) .*)$'
for sentence in ['qiāntnum. thousand', 'jiànm. (used for clothes among other items) piece']:
result = re.search(regex, sentence.replace('t', 't'))
if result:
print(result.groups())

参见Python代码。

结果:

('qiānt', 'num. thousand')
('jiàn', 'm. (used for clothes among other items) piece')

关于t作为re.docs说wUnicode (str)模式

匹配Unicode单词字符;这包括大多数角色可以是任何语言中单词的一部分，以及数字和下划线。如果使用ASCII标志，则只匹配[a-zA-Z0-9_]。

t不是可以成为word一部分的字符。所以你需要把它加起来。尝试用([tw]+?)代替(w+?)

相关内容

最新更新

热门标签：