Python 迭代正则表达式条件以替换列的值



假设我有一个非标准化的数据帧,关键字如下:

data = pd.DataFrame({'tool_description':['bond assy fixture', 'pierce die', 'cad geometrical non-template',
'707 bond assy fixture', 'john pierce die', '123 cad geometrical non-template',
'jjashd bond assy fixture', '10481 pierce die', '81235 cad geometrical non-template']})

数据帧:

约翰·皮尔斯模具
tooldescription
粘合组件夹具
冲孔模
cad几何非模板
707粘合组件夹具
123 cad几何非模板
jjashd粘合组件夹具
10481冲孔模
81235 cad几何非模板

您可以编写一个函数来尝试查找匹配的子字符串,如果未找到,则返回'nan'

def replace(s):
keywords = ['bond assy fixture', 'pierce', 'cad geometrical non-template']
try:
return next(i for i in keywords if i in s)
except StopIteration:
return 'nan'

然后你可以使用apply来制作这个替代

>>> data['standardized_column'] = data.tool_description.apply(replace)
>>> data
tool_description           standardized_column
0                   bond assy fixture             bond assy fixture
1                          pierce die                        pierce
2        cad geometrical non-template  cad geometrical non-template
3               707 bond assy fixture             bond assy fixture
4                     john pierce die                        pierce
5    123 cad geometrical non-template  cad geometrical non-template
6            jjashd bond assy fixture             bond assy fixture
7                    10481 pierce die                        pierce
8  81235 cad geometrical non-template  cad geometrical non-template

如果您需要比简单的子字符串检查更复杂的东西,也可以在replace函数中使用正则表达式来代替if i in s

您已经接近解决方案,只需要进行一些小的润色,如下所示:

data['standardized_column'] = np.nan     # init column to NaN
for word in keywords:
data.loc[data.tool_description.str.contains((rf"b{word}b"), case=False, regex=True), 'standardized_column'] = word

这里,我们使用单词边界b将关键字括在正则表达式中,以避免部分单词匹配。皮尔斯配不上皮尔斯。StringA in StringB的Python测试会产生错误的匹配,因为pierce in mpierce是True,但不是我们想要匹配的

结果:

print(data)

tool_description           standardized_column
0                   bond assy fixture             bond assy fixture
1                          pierce die                        pierce
2        cad geometrical non-template  cad geometrical non-template
3               707 bond assy fixture             bond assy fixture
4                     john pierce die                        pierce
5    123 cad geometrical non-template  cad geometrical non-template
6            jjashd bond assy fixture             bond assy fixture
7                    10481 pierce die                        pierce
8  81235 cad geometrical non-template  cad geometrical non-template

最新更新