熊猫数据帧:从字符串中提取数值(包括小数)



我有一个由一列字符串组成的数据帧。我想提取这些字符串的数字。但是,有些值以米为单位,有些以公里为单位。如何检测数字旁边有"m"或"km",标准化单位,然后将数字提取到新列中?

details                 numbers
Distance                350m
Longest straight        860m
Top speed               305km
Full throttle           61 per cent

期望输出:

details                 numbers
Distance                350
Longest straight        860
Top speed               305000
Full throttle           61

使用:

m = df['numbers'].str.contains('d+km')
df['numbers'] = df['numbers'].str.extract('(d+)', expand=False).astype(int)
df.loc[m, 'numbers'] *= 1000 
print (df)
            details  numbers
0          Distance      350
1  Longest straight      860
2         Top speed   305000
3     Full throttle       61

解释:

  1. 通过contains获取km值的掩码
  2. 提取整数值并按extract强制转换为int
  3. km值更正多个

编辑:对于提取float s 值更改正则表达式extract通过此解决方案,也最后转换为 float s:

print (df)
            details      numbers
0          Distance        1.7km
1  Longest straight       860.8m
2         Top speed        305km
3     Full throttle  61 per cent
m =  df['numbers'].str.contains('d+km')
df['numbers'] = df['numbers'].str.extract('(d*.d+|d+)', expand=False).astype(float)
df.loc[m, 'numbers'] *= 1000 
print (df)
            details   numbers
0          Distance    1700.0
1  Longest straight     860.8
2         Top speed  305000.0
3     Full throttle      61.0

最新更新