使用负regex模式拆分字符串



我想用除特定模式之外的非字母数字字符拆分字符串。

示例:

string_1 = "section (ab) 5(a)"
string_2 = "section -bd, 6(1b)(2)"
string_3 = "section - ac - 12(c)"
string_4 = "Section (ab) 5(1a)(cf) (ad)"
string_5 = "section (ab) 5(a) test (ab) 5 6(ad)"

我想把这些字符串分开,这样我就可以得到下面的输出

["section", "ab", "5(a)"]
["section", "bd", "6(1b)(2)"]
["section", "ac", "12(c)"]
["section", "ab", "5(1a)(cf)", "ad"]
["section", "ab", "5(a)", "test", "ab, "5", "6(ad)"]

更确切地说,我想拆分为除d+([w()]+)模式之外的所有非字母数字字符。

它可以在findall内部的正则表达式中使用:实现

bw+(?:([^)]*))*

RegEx演示

代码:

>>> import re
>>> reg = re.compile(r'bw+(?:([^)]*))*')
>>> arr = ['section (ab) 5(a)', 'section -bd, 6(1b)(2)', 'section - ac - 12(c)', 'Section (ab) 5(1a)(cf) (ad)', 'section (ab) 5(a) test (ab) 5 6(ad)']
>>> for el in arr:
...     print ( reg.findall(el) )
...
['section', 'ab', '5(a)']
['section', 'bd', '6(1b)(2)']
['section', 'ac', '12(c)']
['Section', 'ab', '5(1a)(cf)', 'ad']
['section', 'ab', '5(a)', 'test', 'ab', '5', '6(ad)']

您可以使用

d+[w()]+|w+

请参阅regex演示。

详细信息

  • d+[w()]+-1+个数字,然后是1+个字或()个字符
  • |-或
  • w+-1个+字字符

在ElasticSearch中,使用

"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "\d+[\w()]+|\w+",
"group": 0
}
}

最新更新