我有两个不同的数据帧,如下所示。
df1 = pd.DataFrame([[2., 1., 1., 3.],
[2., 1., 1., 1.],
[2., 1., 1., 0.],
[2., 1., 1., 3.]])
df2 = pd.DataFrame([[2., 0., 1., 1.],
[0., 1., 1., 1.],
[0., 0., 1., 1.],
[2., 2., 1., 1.]] )
我的问题可以通过以下步骤解决。
Step1:我们需要计算df1中所有列值与df2的交集(∩)
df1_c0_intersection_df2 = [[2∩2, 2∩0, 2∩1, 2∩1],
[2∩0, 2∩1, 2∩1, 2∩1],
[2∩0, 2∩0, 2∩1, 2∩1],
[2∩2, 2∩2, 2∩1, 2∩1]]
Step2:如果交集为null,则为False,否则为True,如下所示
df1_c0_intersection_df2_bool = [[True, False, False, False],
[False, False, False, False],
[False, False, False, False],
[True, True, False, False]]
Step3:然后计算每列中True的总数-列表如下所示:
df1_c0_intersection_df2_bool_count = [2, 1, 0, 0]
从上面的列表中可以看出,df1中的第一列与df2中的第一列是基于最高的交叉计数得分进行匹配的。
Step4: df1中的第一列和df2中的第一列将被选择退出。然后将重复相同的列2,依此类推。
df1_c1_intersection_df2 = [[1∩0, 1∩1, 1∩1],
[1∩1, 1∩1, 1∩1],
[1∩2, 1∩1, 1∩1]]
df1_c1_intersection_df2_bool = [[False, True, True],
[True, True, True],
[False, True, True]]
df1_c1_intersection_df2_bool_count = [1, 3, 3]
那么我们将选择其中的任何一个,因为有两个相同的分数。等等......最后,它将返回列的布尔计数的所有最大值的列表。例如,上面的final_list = [max(df1_c0_intersection_df2_bool_count), max(df1_c1_intersection_df2_bool_count)] = [2,3].
我尝试了如下方法。
ints_list = []
for i in range(0, len(df1.T)):
for j in range(0, len(df2.T)):
res = np.intersect1d(df1.iloc[:, :i].values, df2.iloc[:, :j].values)
ints_list.append(res)
为了帮助您理解我所说的内容,下面是步骤1到步骤3的一半解决方案。
DataFrame设置/初始化
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[2., 1., 1., 3.],
[2., 1., 1., 1.],
[2., 1., 1., 0.],
[2., 1., 1., 3.]])
df2 = pd.DataFrame([[2., 0., 1., 1.],
[0., 1., 1., 1.],
[0., 0., 1., 1.],
[2., 2., 1., 1.]] )
print (df1)
print (df2)
df1:
0 1 2 3
0 2.0 1.0 1.0 3.0
1 2.0 1.0 1.0 1.0
2 2.0 1.0 1.0 0.0
3 2.0 1.0 1.0 3.0
df2:
0 1 2 3
0 2.0 0.0 1.0 1.0
1 0.0 1.0 1.0 1.0
2 0.0 0.0 1.0 1.0
3 2.0 2.0 1.0 1.0
步骤1 &2:创建交集布尔DataFrame h1> 为我在迭代相同的dataframe,所以我可以同时做这两件事。
df1_df2_l = {} #will store the key:val pair for df_list
df1_df2_b = {} #will store the key:val pair for df_bool
df1_df2_bool_count = []
for col1 in df1.columns:
bool_counts = [] #to store bool counts for each df1 column
c1 = df1[col1].tolist() #create the list of values for the df1 column
for col2 in df2.columns:
c2 = df2[col2].tolist() #create the list of values for the df2 column
# do the intersection check per your requirement
# If found, store the match, else NaN
z_list = [x if x==y else np.NaN for x,y in zip(c1,c2)]
# do the intersection boolean check per your requirement
# If found, store True else False
z_bool = [x==y for x,y in zip(c1,c2)]
#Now capture this information into the dictionary as key:val pair
df1_df2_l[str(col1)+'_'+str(col2)] = z_list
df1_df2_b[str(col1)+'_'+str(col2)] = z_bool
#store bool_count for each column match with df2
bool_counts.append(sum(z_bool))
#store the bool counts for df1 column as a row in bool_count list
df1_df2_bool_count.append(bool_counts)
#Now that you have this information, convert it into a dataframe
df1_df2_list = pd.DataFrame(df1_df2_l)
df1_df2_bool = pd.DataFrame(df1_df2_b)
df1_df2_l = {} #will store the key:val pair for df_list
df1_df2_b = {} #will store the key:val pair for df_bool
df1_df2_bool_count = []
for col1 in df1.columns:
bool_counts = [] #to store bool counts for each df1 column
c1 = df1[col1].tolist() #create the list of values for the df1 column
for col2 in df2.columns:
c2 = df2[col2].tolist() #create the list of values for the df2 column
# do the intersection check per your requirement
# If found, store the match, else NaN
z_list = [x if x==y else np.NaN for x,y in zip(c1,c2)]
# do the intersection boolean check per your requirement
# If found, store True else False
z_bool = [x==y for x,y in zip(c1,c2)]
#Now capture this information into the dictionary as key:val pair
df1_df2_l[str(col1)+'_'+str(col2)] = z_list
df1_df2_b[str(col1)+'_'+str(col2)] = z_bool
#store bool_count for each column match with df2
bool_counts.append(sum(z_bool))
#store the bool counts for df1 column as a row in bool_count list
df1_df2_bool_count.append(bool_counts)
#Now that you have this information, convert it into a dataframe
df1_df2_list = pd.DataFrame(df1_df2_l)
df1_df2_bool = pd.DataFrame(df1_df2_b)
输出如下所示:
坳的名字:
0_0 = first col in df1 and first col in df2
1_3 = df1中的第二col和df2中的第四col
step1 Results:
0_0 0_1 0_2 0_3 1_0 1_1 1_2 ... 2_1 2_2 2_3 3_0 3_1 3_2 3_3
0 2.0 NaN NaN NaN NaN NaN 1.0 ... NaN 1.0 1.0 NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN 1.0 1.0 ... 1.0 1.0 1.0 NaN 1.0 1.0 1.0
2 NaN NaN NaN NaN NaN NaN 1.0 ... NaN 1.0 1.0 0.0 0.0 NaN NaN
3 2.0 2.0 NaN NaN NaN NaN 1.0 ... NaN 1.0 1.0 NaN NaN NaN NaN
step2 Results:
0_0 0_1 0_2 0_3 1_0 ... 2_3 3_0 3_1 3_2 3_3
0 True False False False False ... True False False False False
1 False False False False False ... True False True True True
2 False False False False False ... True True True False False
3 True True False False False ... True False False False False
步骤3:df_bool
中的值计数#print the bool counts
print (df1_df2_bool_count)
#You dont need to do the below part as we calculated the bool count already
#df1_df2_bool_count = [sum(df1_df2_bool[cols].tolist()) for cols in df1_df2_bool.columns]
#print (df1_df2_bool_count)
它的输出将是:
#new ask to store data per column
[[2, 1, 0, 0], [0, 1, 4, 4], [0, 1, 4, 4], [1, 2, 1, 1]]
#older result without separating for each column
#[2, 1, 0, 0, 0, 1, 4, 4, 0, 1, 4, 4, 1, 2, 1, 1]
你是否期待这样的解决方案:
df1[0] matches best with df2[0] #since 2 is highest of 2, 1, 0, 0 (2 is in 1st (index 0) position)
df1[1] matches best with df2[2] #since 4 is highest of 0, 1, 4, 4 (4 is in 3rd (index 2) position)
df1[2] matches best with df2[3] #since 4 is highest of 0, 1, 4, 4 (4 is in 3rd (index 2) and 4th (index 3) position. We already picked 3rd, so we pick 4th (index 3)
df1[3] matches best with df2[1] #since 2 is highest of 1, 2, 1, 1 (2 is in 2nd (index 1) position)
这些是我们可以找到的df1和df2中每个列的最佳匹配。