如果存在,我如何用匹配的组替换字符串?
text1 = "some text (ID: 1234) some text some text" # --> 1234
text2 = "some text (ID: abc) some (text) some text" # --> abc
text3 = "some text some text some text some text" # --> some text some text some text some text
texts = pd.Series([text1, text2, text3])
我知道regex函数使用lambda函数将文本替换为ID,当且仅当匹配。标识它的re模式将是'.*(ID:s(.*)).*'
,其中组(.*)是我需要的ID。
re.sub
似乎只替换了整个匹配的部分,而re.search
texts = texts.apply(lambda x: SOME_REGEX)
在所提供的模式中:'.*(ID:s(.*)).*'
有一个组,(.*)
。组是用括号括起来的模式。由于这是模式中的第一个组,您可以在re.sub
中将其引用为'1'
。
对于你问的问题:
|-------------------------------------------|-----------------------------------------|
| texts | cleaned |
|-------------------------------------------|-----------------------------------------|
|"some text (ID: 1234) some text some text" | 1234 |
|"some text (ID: abc) some (text) some text"| abc |
|"some text some text some text some text" |"some text some text some text some text"|
|-------------------------------------------|-----------------------------------------|
下面的代码可以工作:
pattern = `'.*(ID:s(.*)).*'`
cleaned = texts.apply(lambda x: re.sub(pattern, r'1', flag=IGNORECASE))
我认为您应该使用.*?
使.*
部分不贪婪,或者使用与除括号[^()]+
外的任何字符匹配的否定字符类,否则您将不会在第1组中拥有abc
。
模式周围不应该有反引号,属性名是flags
而不是标志设置值或re.IGNORECASE
import re
import pandas as pd
text1 = "some text (ID: 1234) some text some text" # --> 1234
text2 = "some text (ID: abc) some (text) some text" # --> abc
text3 = "some text some text some text some text" # --> some text some text some text some text
texts = pd.Series([text1, text2, text3])
pattern = '.*(ID:s([^()]+)).*'
cleaned = texts.apply(lambda x: re.sub(pattern, r'1', x, flags=re.IGNORECASE))
print(cleaned)
输出0 1234
1 abc
2 some text some text some text some text