采样熊猫DF以匹配误差范围内的第二个DF



假设我有两个DF,比如df1,df2,如下所示:

import pandas as pd
import numpy as np
df1 = pd.DataFrame([[0,1,100],[1,1.1,120],[2,0.8,102]],columns=['id','a','b'])
df2 = pd.DataFrame([[0,0.5,110],[1,1.05,94],[2,0.96,145],[3,0.86,112],[4,1.3,97]],
columns=['id','a','b'])
print(df1)
id    a    b
0   0  1.0  100
1   1  1.1  120
2   2  0.8  102
print(df2)
id     a    b
0   0  0.50  110
1   1  1.05   94
2   2  0.96  145
3   3  0.86  112
4   4  1.30   97

现在,假设我选择一些间隔大小da,db。我希望,对于df1中的每一行,从df2中随机选择一行,这样abs(a1-a2)<da,abs(b1-b2)<db。我目前正在做的是非常暴力:

da = 0.2
db = 25
df2_list=[]
nbad = 0
for rid,row in df1.iterrows():
ca = row['a']
cb = row['b']
c_df2 = df2[np.abs(df2['a']-ca)<da]
[np.abs(df2['b']-cb)<db]
if len(c_df2) == 0:
nbad+=1
continue
c_df2 = c_df2.sample()
df2_list.append(c_df2['id'].values[0])

matched_df = df2[df2['id'].isin(df2_list)]
print(matched_df)
id     a    b
1   1  1.05   94
3   3  0.86  112
4   4  1.30   97

然而,就我的实际目的而言,在我的DF很大的地方,这是非常缓慢的。有没有更快的方法来实现这个结果?

这里有一个解决方案:

da = 0.2
db = 25
res = pd.merge(df1.assign(dummy = 1), df2.assign(dummy = 1), on = "dummy").drop("dummy", axis = 1)
res = res[(np.abs(res.a_x - res.a_y) < da) & (np.abs(res.b_x - res.b_y) < db)]
res = res.groupby("id_x").apply(lambda x: x.sample(1))[["id_y", "a_y", "b_y"]]
res.index = res.index.droplevel(1)
print(res)

输出为:

id_y   a_y  b_y
id_x                 
0        1  1.05   94
1        4  1.30   97
2        3  0.86  112

相关内容

最新更新