如果模式存在,在python中提取子字符串



如果存在,我如何用匹配的组替换字符串?

text1 = "some text (ID: 1234) some text some text"  # --> 1234
text2 = "some text (ID: abc) some (text) some text" # --> abc 
text3 = "some text some text some text some text"   # --> some text some text some text some text
texts = pd.Series([text1, text2, text3])

我知道regex函数使用lambda函数将文本替换为ID,当且仅当匹配。标识它的re模式将是'.*(ID:s(.*)).*',其中组(.*)是我需要的ID。

re.sub似乎只替换了整个匹配的部分,而re.search

texts = texts.apply(lambda x: SOME_REGEX)

在所提供的模式中:'.*(ID:s(.*)).*'有一个(.*)。组是用括号括起来的模式。由于这是模式中的第一个组,您可以在re.sub中将其引用为'1'

对于你问的问题:

|-------------------------------------------|-----------------------------------------|
|          texts                            |               cleaned                   |
|-------------------------------------------|-----------------------------------------|
|"some text (ID: 1234) some text some text" |                1234                     |
|"some text (ID: abc) some (text) some text"|                abc                      |
|"some text some text some text some text"  |"some text some text some text some text"|
|-------------------------------------------|-----------------------------------------|

下面的代码可以工作:

pattern = `'.*(ID:s(.*)).*'`
cleaned = texts.apply(lambda x: re.sub(pattern, r'1', flag=IGNORECASE))

我认为您应该使用.*?使.*部分不贪婪,或者使用与除括号[^()]+外的任何字符匹配的否定字符类,否则您将不会在第1组中拥有abc

模式周围不应该有反引号,属性名是flags而不是标志设置值或re.IGNORECASE

import re
import pandas as pd
text1 = "some text (ID: 1234) some text some text"  # --> 1234
text2 = "some text (ID: abc) some (text) some text" # --> abc 
text3 = "some text some text some text some text"   # --> some text some text some text some text
texts = pd.Series([text1, text2, text3])
pattern = '.*(ID:s([^()]+)).*'
cleaned = texts.apply(lambda x: re.sub(pattern, r'1', x, flags=re.IGNORECASE))
print(cleaned)

输出
0                                       1234
1                                        abc
2    some text some text some text some text

最新更新