假设我有一个非标准化的数据帧,关键字如下:
data = pd.DataFrame({'tool_description':['bond assy fixture', 'pierce die', 'cad geometrical non-template',
'707 bond assy fixture', 'john pierce die', '123 cad geometrical non-template',
'jjashd bond assy fixture', '10481 pierce die', '81235 cad geometrical non-template']})
数据帧:
tooldescription |
---|
粘合组件夹具 |
冲孔模 |
cad几何非模板 |
707粘合组件夹具 |
123 cad几何非模板 |
jjashd粘合组件夹具 |
10481冲孔模 |
81235 cad几何非模板 |
您可以编写一个函数来尝试查找匹配的子字符串,如果未找到,则返回'nan'
def replace(s):
keywords = ['bond assy fixture', 'pierce', 'cad geometrical non-template']
try:
return next(i for i in keywords if i in s)
except StopIteration:
return 'nan'
然后你可以使用apply
来制作这个替代
>>> data['standardized_column'] = data.tool_description.apply(replace)
>>> data
tool_description standardized_column
0 bond assy fixture bond assy fixture
1 pierce die pierce
2 cad geometrical non-template cad geometrical non-template
3 707 bond assy fixture bond assy fixture
4 john pierce die pierce
5 123 cad geometrical non-template cad geometrical non-template
6 jjashd bond assy fixture bond assy fixture
7 10481 pierce die pierce
8 81235 cad geometrical non-template cad geometrical non-template
如果您需要比简单的子字符串检查更复杂的东西,也可以在replace
函数中使用正则表达式来代替if i in s
。
您已经接近解决方案,只需要进行一些小的润色,如下所示:
data['standardized_column'] = np.nan # init column to NaN
for word in keywords:
data.loc[data.tool_description.str.contains((rf"b{word}b"), case=False, regex=True), 'standardized_column'] = word
这里,我们使用单词边界b
将关键字括在正则表达式中,以避免部分单词匹配。皮尔斯配不上皮尔斯。StringA in StringB
的Python测试会产生错误的匹配,因为pierce in mpierce
是True,但不是我们想要匹配的
结果:
print(data)
tool_description standardized_column
0 bond assy fixture bond assy fixture
1 pierce die pierce
2 cad geometrical non-template cad geometrical non-template
3 707 bond assy fixture bond assy fixture
4 john pierce die pierce
5 123 cad geometrical non-template cad geometrical non-template
6 jjashd bond assy fixture bond assy fixture
7 10481 pierce die pierce
8 81235 cad geometrical non-template cad geometrical non-template