我有两个数据帧,大致如下:
data1 = pd.DataFrame({'transaction_id': {0: abc, 1: bcd, 2: efg},
'store_number': {0: '1048', 1: '1048', 2: '1048'},
'activity_code': {0: 'deposit-check',
1: 'deposit-check',
2: 'deposit-check'},
'amount': {0: 10, 1: 11, 2: 12}})
data2 = pd.DataFrame({'transaction_id': {0: pqr, 1: qrs, 2: rst},
'store_number': {0: '1048', 1: '1048', 2: '1048'},
'activity_code': {0: 'deposit-check',
1: 'deposit-check',
2: 'deposit-check'},
'amount': {0: 100, 1: 200, 2: 300}})
具有更多行。
我想从每个数据集中取多个子集,并对每个数据集中的总量进行比较。例如,从data1
和data2
:中取出2行
data1_subset1 = pd.DataFrame({'transaction_id': {0: abc, 1: bcd},
'store_number': {0: '1048', 1: '1048'},
'activity_code': {0: 'deposit-check',
1: 'deposit-check'},
'amount': {0: 10, 1: 11}})
data1_subset2 = pd.DataFrame({'transaction_id': {0: abc, 2: efg},
'store_number': {0: '1048', 2: '1048'},
'activity_code': {0: 'deposit-check',
2: 'deposit-check'},
'amount': {0: 10, 2: 12}})
依此类推,直到我有了data1的所有可能的2行组合。
data2_subset1 = pd.DataFrame({'transaction_id': {0: pqr, 1: qrs},
'store_number': {0: '1048', 1: '1048'},
'activity_code': {0: 'deposit-check',
1: 'deposit-check'},
'amount': {0: 100, 1: 200}})
data2_subset2 = pd.DataFrame({'transaction_id': {0: pqr, 2: rst},
'store_number': {0: '1048', 2: '1048'},
'activity_code': {0: 'deposit-check',
2: 'deposit-check'},
'amount': {0: 100, 2: 300}})
依此类推,直到我有了data2的所有可能的2行组合。
现在,对于这些子集中的每一个子集,比如data1_subset1
与data2_subset1
,我想使用内部联接来比较store_number
和activity_code
是否匹配,然后检查总amount
与data1_subset1
与data2_subset1
之间的差异。
此外,我还想将此扩展到所有可能的尺寸组合。在上面的例子中,我们比较了所有2行的组合。但我想将其扩展到2行组合对3行组合,2行对4,3行对5,依此类推,直到所有的可能性都得到检查。
在Python / Pandas
中有没有一种有效的方法来做到这一点。我脑海中的第一种方法只是使用索引的嵌套循环。
使用itertools.combinations
:
from itertools import combinations
for comb in combinations(data1.index, r=2):
print(f'combination {comb}')
print(data1.loc[list(comb)])
作为一个函数:
def subset(df, r=2):
for comb in combinations(df.index, r=r):
yield df.loc[list(comb)]
for df in subset(data1, r=2):
print(df)
输出:
combination (0, 1)
transaction_id store_number activity_code amount
0 abc 1048 deposit-check 10
1 bcd 1048 deposit-check 11
combination (0, 2)
transaction_id store_number activity_code amount
0 abc 1048 deposit-check 10
2 efg 1048 deposit-check 12
combination (1, 2)
transaction_id store_number activity_code amount
1 bcd 1048 deposit-check 11
2 efg 1048 deposit-check 12
如果希望组合中有更多的行,请将r=2
参数更改为所需的行数。