如何在具有非唯一合并键的情况下使用pandas合并函数合并数据



我有大量的数据,在两个文件中我想合并它们,但合并键'ChannelPartnerID'在数据集中的每一行中都没有唯一的值。我想让它将它与第二个数据集中的第一个匹配行合并并将NaN值赋给其余的行。以下是数据

中的示例
import pandas as pd 
t1 = {'ChannelPartnerID': [18263,50429,98593,44804,81015,15273,51484,87695,33906,70117],
'response': [1, 0, 1, 0, 0, 1, 1, 0, 0, 0],
'n_comp': [2, 1, 0, 4, 4, 2, 1, 3, 3, 5],
'loyalty': [1, 1, 1, 1, 1, 1, 1, 0, 1, 1],
'portal': [1, 1, 0, 1, 1, 1, 0, 1, 0, 1],
'rewards': [0, 1, 0, 1, 1, 1, 0, 0, 1, 1],
'nps': [7, 3, 9, 2, 2, 5, 6, 8, 2, 8],
'n_yrs': [8, 3, 6, 5, 2, 7, 6, 3, 3, 6]}
t2 = {'ChannelPartnerID': [18263,10000,18263,33906,10000,81015,10001,81015,10001,33906],
'Month': [5, 10, 8, 10, 1, 4, 3, 10, 12, 7],
'Year': [2019, 2019, 2019, 2019, 2018, 2018, 2019, 2018, 2019, 2019],
'Brand': ['B4', 'B5', 'B1', 'B3', 'B2', 'B2', 'B2', 'B5', 'B2', 'B1'],
'Sales': [21793, 7155, 29630, 1530, 3965, 34608, 39256, 14612, 2902, 6122]}
t1 = pd.DataFrame(t1)
t1

输出:

ChannelPartnerID    response    n_comp  loyalty portal  rewards nps n_yrs
0   18263   1   2   1   1   0   7   8
1   50429   0   1   1   1   1   3   3
2   98593   1   0   1   0   0   9   6
3   44804   0   4   1   1   1   2   5
4   81015   0   4   1   1   1   2   2
5   15273   1   2   1   1   1   5   7
6   51484   1   1   1   0   0   6   6
7   87695   0   3   0   1   0   8   3
8   33906   0   3   1   0   1   2   3
9   70117   0   5   1   1   1   8   6

t2 = pd.DataFrame(t2)
t2

输出:

ChannelPartnerID    Month   Year    Brand   Sales
0   18263   5   2019    B4  21793
1   10000   10  2019    B5  7155
2   18263   8   2019    B1  29630
3   33906   10  2019    B3  1530
4   10000   1   2018    B2  3965
5   81015   4   2018    B2  34608
6   10001   3   2019    B2  39256
7   81015   10  2018    B5  14612
8   10001   12  2019    B2  2902
9   33906   7   2019    B1  6122
pd.merge(t1,t2, on=['ChannelPartnerID'],how= 'left' )

输出:

ChannelPartnerID    response    n_comp  loyalty portal  rewards nps n_yrs   Month   Year    Brand   Sales
0   18263   1   2   1   1   0   7   8   5.0 2019.0  B4  21793.0
1   18263   1   2   1   1   0   7   8   8.0 2019.0  B1  29630.0
2   50429   0   1   1   1   1   3   3   NaN NaN NaN NaN
3   98593   1   0   1   0   0   9   6   NaN NaN NaN NaN
4   44804   0   4   1   1   1   2   5   NaN NaN NaN NaN
5   81015   0   4   1   1   1   2   2   4.0 2018.0  B2  34608.0
6   81015   0   4   1   1   1   2   2   10.0    2018.0  B5  14612.0
7   15273   1   2   1   1   1   5   7   NaN NaN NaN NaN
8   51484   1   1   1   0   0   6   6   NaN NaN NaN NaN
9   87695   0   3   0   1   0   8   3   NaN NaN NaN NaN
10  33906   0   3   1   0   1   2   3   10.0    2019.0  B3  1530.0
11  33906   0   3   1   0   1   2   3   7.0 2019.0  B1  6122.0
12  70117   0   5   1   1   1   8   6   NaN NaN NaN NaN

what I want

ChannelPartnerID    response    n_comp  loyalty portal  rewards nps n_yrs   Month   Year    Brand   Sales
0   18263   1   2   2   1   0   7   8   5.0 2019.0  B4  21793.0
1   18263   NaN NaN NaN NaN NaN NaN NaN 8.0 2019.0  B1  29630.0
2   50429   0   1   1   1   1   3   3   NaN NaN NaN NaN
3   98593   1   0   1   0   0   9   6   NaN NaN NaN NaN
4   44804   0   4   1   1   1   2   5   NaN NaN NaN NaN
5   81015   0   4   1   1   1   2   2   4.0 2018.0  B2  34608.0
6   81015   NaN NaN NaN NaN NaN NaN NaN 10.0 2018.0 B5  14612.0
7   15273   1   2   1   1   1   5   7   NaN NaN NaN NaN
8   51484   1   1   1   0   0   6   6   NaN NaN NaN NaN
9   87695   0   3   0   1   0   8   3   NaN NaN NaN NaN
10  33906   0   3   1   0   1   2   3   10.0 2019.0 B3  1530.0
11  33906   NaN NaN NaN NaN NaN NaN NaN 7.0 2019.0  B1  6122.0
12  70117   0   5   1   1   1   8   6   NaN NaN NaN NaN

您可以使用以下用例:

t2_first = t2.groupby('ChannelPartnerID', as_index=False).nth(0)
t2_rest = t2[~(t2.index.isin(t2_first.index))]
t2_rest = t2_rest[(t2_rest['ChannelPartnerID'].isin(t1['ChannelPartnerID']))]
merge_1 = t1.merge(t2_first, on='ChannelPartnerID', how='left')
result = merge_1.append(t2_rest).sort_values('ChannelPartnerID')
result = result.reset_index(drop=True)
result

然而,我不知道为什么你想要得到这个表结构。如果您想要的是一个只有第一个匹配的表,那么merge_1可能就足够了。

相关内容

最新更新