比较两个python pandas dataframe字符串列，以识别公共字符串并将公共字符串添加到新列中

我有以下两个熊猫df:

df1:             df2:
item_name        item_cleaned
abc xyz          Def
xuy DEF          Ghi
s GHI lsoe       Abc
p ABc ois

我需要写一个函数来比较df2.item_cleaned和df1.item_name，看看df2.item_cleaned中的字符串是否存在于df1.item_name中(不区分大小写)。

其中字符串存在(无论情况)，我想创建一个新的列df1.item_final和输入df2.item_cleaned字符串值在这个新的列为每一行。

输出应该像这样:

df1:                                 df2:
item_name        item_final          item_cleaned
abc xyz          Abc                 Def
xuy DEF          Def                 Ghi
s GHI lsoe       Ghi                 Abc
p ABc ois        Abc

作为参考，我要清理的df1有12列，大约40万行。

创建一个地图obj_map, key为item_cleaned的小写字母，value为item_cleaned.
使用regexp提取tem_cleaned，标记为re.IGNORECASE
然后降低提取部分并将其替换为obj_map得到item_final

import re
item_cleaned = df2['item_cleaned'].dropna().unique()
obj_map = pd.Series(dict(zip(map(str.lower, item_cleaned), item_cleaned)))
# escape the special characters
re_pat = '(%s)' % '|'.join([re.escape(i) for i in item_cleaned])
df1['item_final'] = df1['item_name'].str.extract(re_pat, flags=re.IGNORECASE)
df1['item_final'] = df1['item_final'].str.lower().map(obj_map)

obj_map

def    Def
ghi    Ghi
abc    Abc
dtype: object

df1

item_name item_final
0     abc xyz        Abc
1     xuy DEF        Def
2  s GHI lsoe        Ghi
3   p ABc ois        Abc

相关内容

最新更新

热门标签：