在列中搜索一组特定的文本,如果找到文本,则在新列panda中输入新的文本字符串



我正在尝试查找数据帧中任何列中出现单词"gas"、diesel或"ev"的实例(不区分大小写(。如果在这些列中找到这些单词的任何版本,我想在题为"fuel"的新列中输入燃料类型的缩写。

excerpt of my dataframe
SUMN                 SOUN               MATN   
Light duty vehicle   Diesel Tire wear   Rubber
Heavy duty diesel    Non-catalyst       Diesel
Light duty truck     catalyst           Gasoline
Medium duty vehicle  EV brake wear      brakes
What I'm hoping to output
SUMN                 SOUN               MATN      FUEL
Light duty vehicle   Diesel Tire wear   Rubber    DSL
Heavy duty diesel    Non-catalyst       Diesel    DSL
Light duty truck     catalyst           Gasoline  GAS
Medium duty vehicle  EV brake wear      brakes    ELEC

我该如何做到这一点?

我一开始就可以在一列中查找一种类型的字符串,但却被如何越过这一点所困扰。

df['FUEL'] = df['SUMN'].str.contains('diesel', case=False)

这里有一种方法,使用applystr.contains来检查每个单词的所有列。最后,我们将单词映射到正确的单词,例如ev -> ELECT

注意,我在正则表达式中使用了?i,这使得它不区分大小写:

words = ['gas', 'diesel', 'ev']
mapping = {'gas':'GAS', 'diesel':'DSL', 'ev':'ELEC'}
for word in words:
m = df.apply(lambda x: x.str.contains(f'(?i)({word})')).any(axis=1)
df.loc[m, 'FUEL'] = mapping[word]

输出

SUMN              SOUN      MATN  FUEL
0   Light duty vehicle  Diesel Tire wear    Rubber   DSL
1    Heavy duty diesel      Non-catalyst    Diesel   DSL
2     Light duty truck          catalyst  Gasoline   GAS
3  Medium duty vehicle     EV brake wear    brakes  ELEC

肯定有一个更优化的解决方案,但希望这能让你走上正确的道路。。。基本上循环通过每一行,循环通过列和潜在的燃料串,并决定使用哪个缩写:

d={'diesel':'DSL','gasoline':'GAS','ev':'ELEC'}
df['all'] = df.apply(''.join, axis=1)
for i,row in df.iterrows():
df.at[i,'FUEL'] = d[[key for key in d.keys() if key in row['all'].lower()][0]]
del df['all']

输出:

SUMN              SOUN      MATN  FUEL
0   Light duty vehicle  Diesel Tire wear    Rubber   DSL
1    Heavy duty diesel      Non-catalyst    Diesel   DSL
2     Light duty truck          catalyst  Gasoline   GAS
3  Medium duty vehicle     EV brake wear    brakes  ELEC

这假设每排中只有一种燃料类型

编辑:灵感来自另一个解决方案:

import re
d={'diesel':'DSL','gasoline':'GAS','ev':'ELEC'}
df['FUEL'] = df.apply(lambda x: d[re.search('gasoline|diesel|ev',''.join(x).lower()).group()], axis=1)

相同输出:(

相关内容

最新更新