我整个星期都在为这个问题而苦苦挣扎。我有两个数据帧,如下所示:
DF1:
Account| ID | Name
--------------------------------------
B36363 | 2019001 | John
G47281 | 2019002;2018101 | Alice;Emma
H46291 | 2019001 | John
DF2:
Account | Col_B | Col_C
-----------------------------
B36363-0 | text_b1 | text_c1
01_G47281 | text_b2 | text_c2
X_H46291 | text_b3 | text_c3
II_G47281 | text_b4 | text_C4
我想在 df2 时合并帐户上的这些数据帧。帐户包含df1。帐户(与正常合并/加入不完全匹配!
期望输出:
DF3:
Account | Col_B | Col_C | ID | Name
--------------------------------------------------------------
B36363-0 | text_b1 | text_c1 | 2019001 | John
01_G47281 | text_b2 | text_c3 | 2019002;2018101 | Alice;Emma
X_H46291 | text_b3 | text_c3 | 2019001 | John
II_G47281 | text_b4 | text_C4 | 2019002;2018101 | Alice;Emma
我没有示例代码,因为我不知道如何处理这个问题。正常的合并/连接运行良好,但如果我想使用包含,则不行。提前非常感谢
你可以尝试str.extract
join()
:
d=df1.set_index('Account').agg(list,axis=1).to_dict()
p='({})'.format('|'.join(df1.Account))
#'(B36363|G47281|H46291)'
m=pd.DataFrame(df2.Account.str.extract(p,expand=False).map(d).fillna('').tolist()
,columns=['ID','Name'],index=df2.index)
df2.join(m)
Account Col_B Col_C ID Name
1 B36363-0 text_b1 text_c1 2019001 John
2 01_G47281 text_b2 text_c2 2019002;2018101 Alice;Emma
3 X_H46291 text_b3 text_c3 2019001 John
4 II_G47281 text_b4 text_C4 2019002;2018101 Alice;Emma
使用我的fuzzy_merge
函数:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df3 = fuzzy_merge(df2, df1, 'Account', 'Account', threshold=80)
.merge(df1, left_on='matches', right_on='Account', suffixes=['', '_2'])
.drop(columns=['matches', 'Account_2'])
输出
Account Col_B Col_C ID Name
0 B36363-0 text_b1 text_c1 2019001 John
1 01_G47281 text_b2 text_c2 2019002;2018101 Alice;Emma
2 II_G47281 text_b4 text_C4 2019002;2018101 Alice;Emma
3 X_H46291 text_b3 text_c3 2019001 John
链接答案Fuzzy_merge
函数:
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
df_1 is the left table to join
df_2 is the right table to join
key1 is the key column of the left table
key2 is the key column of the right table
threshold is how close the matches should be to return a match
limit is the amount of matches will get returned, these are sorted high to low
"""
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['matches'] = m2
return df_1
尝试str.extract
df2.Account
并将结果设置为df2
和join
的索引
pat1 = '('+'|'.join(df1.Account)+')'
s = df2.Account.str.extract(pat1, expand=False)
df2.set_index(s).join(df1.set_index('Account')).reset_index(drop=True)
Out[644]:
Account Col_B Col_C ID Name
0 B36363-0 text_b1 text_c1 2019001 John
1 01_G47281 text_b2 text_c2 2019002;2018101 Alice;Emma
2 II_G47281 text_b4 text_C4 2019002;2018101 Alice;Emma
3 X_H46291 text_b3 text_c3 2019001 John
另一种方法是使用merge
df2.assign(Account2=df2.Account.str.extract(pat1, expand=False))
.merge(df1, left_on='Account2', right_on='Account', suffixes=('', 'y'))
.drop(['Account2', 'Accounty'], 1)
Out[645]:
Account Col_B Col_C ID Name
0 B36363-0 text_b1 text_c1 2019001 John
1 01_G47281 text_b2 text_c2 2019002;2018101 Alice;Emma
2 II_G47281 text_b4 text_C4 2019002;2018101 Alice;Emma
3 X_H46291 text_b3 text_c3 2019001 John