使用负regex模式拆分字符串

我想用除特定模式之外的非字母数字字符拆分字符串。

示例：

string_1 = "section (ab) 5(a)"
string_2 = "section -bd, 6(1b)(2)"
string_3 = "section - ac - 12(c)"
string_4 = "Section (ab) 5(1a)(cf) (ad)"
string_5 = "section (ab) 5(a) test (ab) 5 6(ad)"

我想把这些字符串分开，这样我就可以得到下面的输出

["section", "ab", "5(a)"]
["section", "bd", "6(1b)(2)"]
["section", "ac", "12(c)"]
["section", "ab", "5(1a)(cf)", "ad"]
["section", "ab", "5(a)", "test", "ab, "5", "6(ad)"]

更确切地说，我想拆分为除d+([w()]+)模式之外的所有非字母数字字符。

它可以在findall内部的正则表达式中使用：实现

bw+(?:([^)]*))*

RegEx演示

代码：

>>> import re
>>> reg = re.compile(r'bw+(?:([^)]*))*')
>>> arr = ['section (ab) 5(a)', 'section -bd, 6(1b)(2)', 'section - ac - 12(c)', 'Section (ab) 5(1a)(cf) (ad)', 'section (ab) 5(a) test (ab) 5 6(ad)']
>>> for el in arr:
...     print ( reg.findall(el) )
...
['section', 'ab', '5(a)']
['section', 'bd', '6(1b)(2)']
['section', 'ac', '12(c)']
['Section', 'ab', '5(1a)(cf)', 'ad']
['section', 'ab', '5(a)', 'test', 'ab', '5', '6(ad)']

您可以使用

d+[w()]+|w+

请参阅regex演示。

详细信息

d+[w()]+-1+个数字，然后是1+个字或(或)个字符
|-或
w+-1个+字字符

在ElasticSearch中，使用

"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "\d+[\w()]+|\w+",
"group": 0
}
}

相关内容

最新更新

热门标签：