df:
Id Product_description Vol
100 Oil 32cmX36cmx25cm 1s 1
101 Oil 32cmX36cmx30cm 1M 1
102 bag 45cmx3f 1s 1
103 bag 45cmx32f 2L 2
104 BAG 45cmx32f 3 L 3
105 BELT 135L 50S 50
106 BELT 194L 50S NaN
107 BELT 194L 50 L#144 NaN
108 BELT L NaN
我需要在这里做两件事:
- 替换Vol列中的NaN。(最后一位后面跟单位,如"L"、"S"(
- 根据Vol&产品description列
输出:
Id Product_description Vol Unit(Check Vol and get Unit from Product_description)
100 Oil 32cmX36cmx25cm 1s 1 s
101 Oil 32cmX36cmx30cm 1M 1 M
102 bag 45cmx3f 1s 1 s
103 bag 45cmx32f 2L 2 L
104 BAG 45cmx32f 3 L 3 L
105 BELT 135L 50S 50 S
106 BELT 194L 50S 50 S
107 BELT 194L 50 L#144 50 L
107 BELT L NaN NaN
您可以使用带有命名捕获组的regex,extractall
获取所有匹配,groupby
+last
保留最后一个,然后join
返回原始数据帧:
(df
.drop(columns='Vol')
.join(df['Product_description']
.str.extractall('(?P<Vol>d+)s*(?P<Unit>[a-zA-Z]+)')
.groupby(level=0).last()
)
)
输出:
Id Product_description Vol Unit
0 100 Oil 32cmX36cmx25cm 1s 1 s
1 101 Oil 32cmX36cmx30cm 1M 1 M
2 102 bag 45cmx3f 1s 1 s
3 103 bag 45cmx32f 2L 2 L
4 104 BAG 45cmx32f 3 L 3 L
5 105 BELT 135L 50S 50 S
6 106 BELT 194L 50S 50 S
7 107 BELT 194L 50 L#144 50 L
8 108 BELT L NaN NaN
注意。上述内容忽略了原来的";Vol";列,然后简单地从头开始再次提取数据。如果你不想覆盖潜在的不同值(尽管它们与字符串不匹配(,你可以这样做:
df2 = (df['Product_description']
.str.extractall('(?P<Vol>d+)s*(?P<Unit>[a-zA-Z]+)')
.groupby(level=0).last()
)
df.combine_first(df2)