贪心法根据相交值的总数迭代配对两个数据帧的列



我有两个不同的数据帧,如下所示。

df1 = pd.DataFrame([[2., 1., 1., 3.],
[2., 1., 1., 1.],
[2., 1., 1., 0.],
[2., 1., 1., 3.]])
df2 =  pd.DataFrame([[2., 0., 1., 1.],
[0., 1., 1., 1.],
[0., 0., 1., 1.],
[2., 2., 1., 1.]] )

我的问题可以通过以下步骤解决。

Step1:我们需要计算df1中所有列值与df2的交集(∩)

df1_c0_intersection_df2 = [[2∩2, 2∩0, 2∩1, 2∩1],
[2∩0, 2∩1, 2∩1, 2∩1],
[2∩0, 2∩0, 2∩1, 2∩1],
[2∩2, 2∩2, 2∩1, 2∩1]]

Step2:如果交集为null,则为False,否则为True,如下所示

df1_c0_intersection_df2_bool = [[True, False, False, False],
[False, False, False, False],
[False, False, False, False],
[True, True, False, False]]

Step3:然后计算每列中True的总数-列表如下所示:

df1_c0_intersection_df2_bool_count = [2, 1, 0, 0]

从上面的列表中可以看出,df1中的第一列与df2中的第一列是基于最高的交叉计数得分进行匹配的。

Step4: df1中的第一列和df2中的第一列将被选择退出。然后将重复相同的列2,依此类推。

df1_c1_intersection_df2 = [[1∩0, 1∩1, 1∩1],
[1∩1, 1∩1, 1∩1],
[1∩2, 1∩1, 1∩1]]
df1_c1_intersection_df2_bool = [[False, True, True],
[True, True, True],
[False, True, True]]
df1_c1_intersection_df2_bool_count = [1, 3, 3]

那么我们将选择其中的任何一个,因为有两个相同的分数。等等......最后,它将返回列的布尔计数的所有最大值的列表。例如,上面的final_list = [max(df1_c0_intersection_df2_bool_count), max(df1_c1_intersection_df2_bool_count)] = [2,3].

我尝试了如下方法。

ints_list = []
for i in range(0, len(df1.T)):
for j in range(0, len(df2.T)):
res = np.intersect1d(df1.iloc[:, :i].values, df2.iloc[:, :j].values)
ints_list.append(res)

为了帮助您理解我所说的内容,下面是步骤1到步骤3的一半解决方案。

DataFrame设置/初始化

import pandas as pd
import numpy as np
df1 = pd.DataFrame([[2., 1., 1., 3.],
[2., 1., 1., 1.],
[2., 1., 1., 0.],
[2., 1., 1., 3.]])
df2 =  pd.DataFrame([[2., 0., 1., 1.],
[0., 1., 1., 1.],
[0., 0., 1., 1.],
[2., 2., 1., 1.]] )
print (df1)
print (df2)

df1:

0    1    2    3
0  2.0  1.0  1.0  3.0
1  2.0  1.0  1.0  1.0
2  2.0  1.0  1.0  0.0
3  2.0  1.0  1.0  3.0

df2:

0    1    2    3
0  2.0  0.0  1.0  1.0
1  0.0  1.0  1.0  1.0
2  0.0  0.0  1.0  1.0
3  2.0  2.0  1.0  1.0

步骤1 &2:创建交集布尔DataFrame h1> 为我在迭代相同的dataframe,所以我可以同时做这两件事。
df1_df2_l = {} #will store the key:val pair for df_list
df1_df2_b = {} #will store the key:val pair for df_bool
df1_df2_bool_count = []
for col1 in df1.columns:
bool_counts = [] #to store bool counts for each df1 column
c1 = df1[col1].tolist() #create the list of values for the df1 column
for col2 in df2.columns:
c2 = df2[col2].tolist() #create the list of values for the df2 column
# do the intersection check per your requirement
# If found, store the match, else NaN
z_list = [x if x==y else np.NaN for x,y in zip(c1,c2)] 
# do the intersection boolean check per your requirement
# If found, store True else False
z_bool = [x==y for x,y in zip(c1,c2)]
#Now capture this information into the dictionary as key:val pair
df1_df2_l[str(col1)+'_'+str(col2)] = z_list
df1_df2_b[str(col1)+'_'+str(col2)] = z_bool
#store bool_count for each column match with df2
bool_counts.append(sum(z_bool))
#store the bool counts for df1 column as a row in bool_count list
df1_df2_bool_count.append(bool_counts)

#Now that you have this information, convert it into a dataframe
df1_df2_list = pd.DataFrame(df1_df2_l)
df1_df2_bool = pd.DataFrame(df1_df2_b)

输出如下所示:

坳的名字:

0_0 = first col in df1 and first col in df2

1_3 = df1中的第二col和df2中的第四col

step1 Results:

0_0  0_1  0_2  0_3  1_0  1_1  1_2  ...  2_1  2_2  2_3  3_0  3_1  3_2  3_3
0  2.0  NaN  NaN  NaN  NaN  NaN  1.0  ...  NaN  1.0  1.0  NaN  NaN  NaN  NaN
1  NaN  NaN  NaN  NaN  NaN  1.0  1.0  ...  1.0  1.0  1.0  NaN  1.0  1.0  1.0
2  NaN  NaN  NaN  NaN  NaN  NaN  1.0  ...  NaN  1.0  1.0  0.0  0.0  NaN  NaN
3  2.0  2.0  NaN  NaN  NaN  NaN  1.0  ...  NaN  1.0  1.0  NaN  NaN  NaN  NaN

step2 Results:

0_0    0_1    0_2    0_3    1_0  ...   2_3    3_0    3_1    3_2    3_3
0   True  False  False  False  False  ...  True  False  False  False  False
1  False  False  False  False  False  ...  True  False   True   True   True
2  False  False  False  False  False  ...  True   True   True  False  False
3   True   True  False  False  False  ...  True  False  False  False  False

步骤3:df_bool

中的值计数
#print the bool counts
print (df1_df2_bool_count)
#You dont need to do the below part as we calculated the bool count already
#df1_df2_bool_count = [sum(df1_df2_bool[cols].tolist()) for cols in df1_df2_bool.columns]
#print (df1_df2_bool_count)

它的输出将是:

#new ask to store data per column
[[2, 1, 0, 0], [0, 1, 4, 4], [0, 1, 4, 4], [1, 2, 1, 1]]
#older result without separating for each column
#[2, 1, 0, 0, 0, 1, 4, 4, 0, 1, 4, 4, 1, 2, 1, 1]

你是否期待这样的解决方案:

df1[0] matches best with df2[0] #since 2 is highest of 2, 1, 0, 0 (2 is in 1st (index 0) position)
df1[1] matches best with df2[2] #since 4 is highest of 0, 1, 4, 4 (4 is in 3rd (index 2) position)
df1[2] matches best with df2[3] #since 4 is highest of 0, 1, 4, 4 (4 is in 3rd (index 2) and 4th (index 3) position. We already picked 3rd, so we pick 4th (index 3)
df1[3] matches best with df2[1] #since 2 is highest of 1, 2, 1, 1 (2 is in 2nd (index 1) position)

这些是我们可以找到的df1和df2中每个列的最佳匹配。

最新更新