我正在尝试查找数据帧中任何列中出现单词"gas"、diesel或"ev"的实例(不区分大小写(。如果在这些列中找到这些单词的任何版本,我想在题为"fuel"的新列中输入燃料类型的缩写。
excerpt of my dataframe
SUMN SOUN MATN
Light duty vehicle Diesel Tire wear Rubber
Heavy duty diesel Non-catalyst Diesel
Light duty truck catalyst Gasoline
Medium duty vehicle EV brake wear brakes
What I'm hoping to output
SUMN SOUN MATN FUEL
Light duty vehicle Diesel Tire wear Rubber DSL
Heavy duty diesel Non-catalyst Diesel DSL
Light duty truck catalyst Gasoline GAS
Medium duty vehicle EV brake wear brakes ELEC
我该如何做到这一点?
我一开始就可以在一列中查找一种类型的字符串,但却被如何越过这一点所困扰。
df['FUEL'] = df['SUMN'].str.contains('diesel', case=False)
这里有一种方法,使用apply
和str.contains
来检查每个单词的所有列。最后,我们将单词映射到正确的单词,例如ev -> ELECT
。
注意,我在正则表达式中使用了?i
,这使得它不区分大小写:
words = ['gas', 'diesel', 'ev']
mapping = {'gas':'GAS', 'diesel':'DSL', 'ev':'ELEC'}
for word in words:
m = df.apply(lambda x: x.str.contains(f'(?i)({word})')).any(axis=1)
df.loc[m, 'FUEL'] = mapping[word]
输出
SUMN SOUN MATN FUEL
0 Light duty vehicle Diesel Tire wear Rubber DSL
1 Heavy duty diesel Non-catalyst Diesel DSL
2 Light duty truck catalyst Gasoline GAS
3 Medium duty vehicle EV brake wear brakes ELEC
肯定有一个更优化的解决方案,但希望这能让你走上正确的道路。。。基本上循环通过每一行,循环通过列和潜在的燃料串,并决定使用哪个缩写:
d={'diesel':'DSL','gasoline':'GAS','ev':'ELEC'}
df['all'] = df.apply(''.join, axis=1)
for i,row in df.iterrows():
df.at[i,'FUEL'] = d[[key for key in d.keys() if key in row['all'].lower()][0]]
del df['all']
输出:
SUMN SOUN MATN FUEL
0 Light duty vehicle Diesel Tire wear Rubber DSL
1 Heavy duty diesel Non-catalyst Diesel DSL
2 Light duty truck catalyst Gasoline GAS
3 Medium duty vehicle EV brake wear brakes ELEC
这假设每排中只有一种燃料类型
编辑:灵感来自另一个解决方案:
import re
d={'diesel':'DSL','gasoline':'GAS','ev':'ELEC'}
df['FUEL'] = df.apply(lambda x: d[re.search('gasoline|diesel|ev',''.join(x).lower()).group()], axis=1)
相同输出:(