我想用除特定模式之外的非字母数字字符拆分字符串。
示例:
string_1 = "section (ab) 5(a)"
string_2 = "section -bd, 6(1b)(2)"
string_3 = "section - ac - 12(c)"
string_4 = "Section (ab) 5(1a)(cf) (ad)"
string_5 = "section (ab) 5(a) test (ab) 5 6(ad)"
我想把这些字符串分开,这样我就可以得到下面的输出
["section", "ab", "5(a)"]
["section", "bd", "6(1b)(2)"]
["section", "ac", "12(c)"]
["section", "ab", "5(1a)(cf)", "ad"]
["section", "ab", "5(a)", "test", "ab, "5", "6(ad)"]
更确切地说,我想拆分为除d+([w()]+)
模式之外的所有非字母数字字符。
它可以在findall
内部的正则表达式中使用:实现
bw+(?:([^)]*))*
RegEx演示
代码:
>>> import re
>>> reg = re.compile(r'bw+(?:([^)]*))*')
>>> arr = ['section (ab) 5(a)', 'section -bd, 6(1b)(2)', 'section - ac - 12(c)', 'Section (ab) 5(1a)(cf) (ad)', 'section (ab) 5(a) test (ab) 5 6(ad)']
>>> for el in arr:
... print ( reg.findall(el) )
...
['section', 'ab', '5(a)']
['section', 'bd', '6(1b)(2)']
['section', 'ac', '12(c)']
['Section', 'ab', '5(1a)(cf)', 'ad']
['section', 'ab', '5(a)', 'test', 'ab', '5', '6(ad)']
您可以使用
d+[w()]+|w+
请参阅regex演示。
详细信息
d+[w()]+
-1+个数字,然后是1+个字或(
或)
个字符|
-或w+
-1个+字字符
在ElasticSearch中,使用
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "\d+[\w()]+|\w+",
"group": 0
}
}