我有一个主数据帧(main_df(,类似于:
A B X Y Id1
0 cat cat1 catabc 0.1 uuid01
1 cat cat1 catxyz 0.4 uuid02
2 cat cat2 catpqr 0.5 uuid01
3 dog dog1 dogxyz 0.3 uuid03
4 dog dog2 dogpqr 0.2 uuid02
5 dog dog2 dogabc 0.8 uuid01
还有另一个有助于映射的数据帧(map_df(:
A B X Y Id2
0 cat cat1 catxyz 0.4 nx01
1 cat cat1 NaN NaN nx02
2 cat NaN NaN NaN nx03
3 dog dog1 dogxyz NaN nx04
4 dog NaN NaN NaN nx05
列的映射优先级为:['A', 'B', 'X', 'Y']
。
因此,如果一行map_df
的所有单元格与main_df
的任何一行匹配,则应将Id2
的相应元素添加到main_df
。如果不存在Y
,则映射应当继续进行['A', 'B', 'X']
;如果X
也不存在,则应继续执行['A', 'B']
,依此类推。
我能够通过数据帧合并实现部分结果。例如:
main_df = main_df.merge(map_df, how='left', on=['A', 'B', 'X', 'Y']))
但我无法在这里找出基于优先级的角度。
作为映射的结果,数据帧(result_df(应该类似于:
A B X Y Id1 Id2
0 cat cat1 catabc 0.1 uuid01 nx02
1 cat cat1 catxyz 0.4 uuid02 nx01
2 cat cat2 catpqr 0.5 uuid01 nx03
3 dog dog1 dogxyz 0.3 uuid03 nx04
4 dog dog2 dogpqr 0.2 uuid02 nx05
5 dog dog2 dogabc 0.8 uuid01 nx05
这将解决问题。不过不是很漂亮。
这是一种迭代算法,它先向前遍历数据,然后再向后遍历数据。
正向传递以逐渐降低的优先级解决合并,从而从数据结构中消除匹配。它使用inner merge
策略。
向后传递将输出从最低优先级更新为最高优先级。我注意到最后的update
调用有点慢。
merge_on = ['A', 'B', 'X', 'Y']
tot_cols = len(merge_on)
operation_main_df = main_df.copy()
outputs = []
# forward pass on progressively smaller sets
for first_idx in range(len(merge_on)):
merged_id2 = operation_main_df.merge(map_df, how='inner',
on=merge_on[0:tot_cols-first_idx],right_index=True,suffixes={"","_y"})
# the relevant output has the right number of NaN columns, the rest is garbage.
outputs.append(merged_id2[merged_id2.isna().sum(axis=1)==first_idx])
# backward updating pass
reverse_it = iter(outputs[::-1])
out = next(reverse_it)[['A','B','X','Y','Id1','Id2']].copy()
for el in reverse_it:
out.update(el)
输出:
A B X Y Id1 Id2
0 cat cat1 catabc 0.1 uuid01 nx02
1 cat cat1 catxyz 0.4 uuid02 nx01
2 cat cat2 catpqr 0.5 uuid01 nx03
3 dog dog1 dogxyz 0.3 uuid03 nx04
4 dog dog2 dogpqr 0.2 uuid02 nx05
5 dog dog2 dogabc 0.8 uuid01 nx05
编辑:在我的机器上加速10倍
# change
out = next(reverse_it)[['A','B','X','Y','Id1','Id2']]
# into
out = next(reverse_it)[['A','B','X','Y','Id1','Id2']].copy()
使用SQL引擎如何处理复杂查询的概念。生成一个偏笛卡尔乘积,然后将结果向下过滤到所需的集合。
在这种情况下,计算出的匹配(重量(
result_df = (main_df
# minimum match - just A, effectivaly a partial catersian producr
.merge(map_df, on=["A"], suffixes=("","_m"))
# calculate weight of match. NaN is neutral, different penalise, same reward
.assign(
B_w=lambda dfa: np.where(dfa["B_m"].isna(), 0, np.where(dfa["B"]==dfa["B_m"],1,-1)*1000),
X_w=lambda dfa: np.where(dfa["X_m"].isna(), 0, np.where(dfa["X"]==dfa["X_m"],1,-1)*100),
Y_w=lambda dfa: np.where(dfa["Y_m"].isna(), 0, np.where(dfa["Y"]==dfa["Y_m"],1,-1)*10),
w=lambda dfa: dfa.loc[:,["B_w","X_w","Y_w"]].sum(axis=1)
)
# biggest weight is one we want
.sort_values(["A","B","X","Y","w"], ascending=[True,True,True,True,False])
.groupby(["A","B","X","Y"]).first()
# # cleanup and don't show workings...
.reset_index()
.loc[:,["A","B","X","Y","Id1","Id2"]]
)
在这种情况下,它是可用的最长密钥。
main_df = pd.read_csv(io.StringIO(""" A B X Y Id1
0 cat cat1 catabc 0.1 uuid01
1 cat cat1 catxyz 0.4 uuid02
2 cat cat2 catpqr 0.5 uuid01
3 dog dog1 dogxyz 0.3 uuid03
4 dog dog2 dogpqr 0.2 uuid02
5 dog dog2 dogabc 0.8 uuid01"""), sep="s+")
map_df = pd.read_csv(io.StringIO(""" A B X Y Id2
0 cat cat1 catxyz 0.4 nx01
1 cat cat1 NaN NaN nx02
2 cat NaN NaN NaN nx03
3 dog dog1 dogxyz NaN nx04
4 dog NaN NaN NaN nx05"""), sep="s+")
# result_df =
result_df = (main_df
# minimum match - just A, effectively a partial catersian product
.merge(map_df, on=["A"], suffixes=("","_m"))
# simpler to use empty string for NaN
.fillna("")
# synthetic column of rest of key and length
.assign(c=lambda dfa: dfa["B_m"]+dfa["X_m"]+dfa["Y_m"].astype(str),
l=lambda dfa: dfa["c"].str.len())
# longest synthetic key is one we want
.sort_values(["A","B","X","Y","l"], ascending=[True,True,True,True,False])
.groupby(["A","B","X","Y"]).first()
# cleanup and don't show workings...
.reset_index()
.loc[:,["A","B","X","Y","Id1","Id2"]]
)
输出(使用加权法(
A B X Y Id1 Id2
0 cat cat1 catabc 0.1 uuid01 nx02
1 cat cat1 catxyz 0.4 uuid02 nx01
2 cat cat2 catpqr 0.5 uuid01 nx03
3 dog dog1 dogxyz 0.3 uuid03 nx04
4 dog dog2 dogabc 0.8 uuid01 nx05
5 dog dog2 dogpqr 0.2 uuid02 nx05