在两个不同的 excel 文件/数据帧中检查具有相同键值的多行值的有效方法是什么?

我有两个excel文件。两者都包含有关相同数据对象的信息。数据对象由类型为str的对象编号(列ON)标识。

例：

Table 1                                Table 2
ON      colA  colB  colToUpdate         ON   colImportant
1.2.3    abc   123                      1.2.3      inf
2.9.6    ert   987                      1.2.3      mat
3.5.0    nms   021                      2.9.6      mat
2.9.6      tr
2.9.6      ch
3.5.0      tr

和

myValues={inf, ch}

任务：

我需要检查表 2 中的colImportant值之一是否在我的myValues中，并且该数据对象(具有相同对象编号的行)需要在df1中获取colToUpdate中的值"Ok"。

期望：

new Table 1
ON      colA  colB  colToUpdate        
1.2.3    abc   123     Ok                
2.9.6    ert   987     Ok               
3.5.0    nms   021     NaN

我想过将两者保存在单独的数据帧中(表中 1 inddf1和 table2 在df2中)，并在更新df1中的下一列时始终在df2中搜索相同的对象编号。但这总是会搜索整个df2(有大约 30000 个数据对象，这意味着 30000 行df1.在df2中，有 75000 行，因为一个数据对象可以用colImportant中的另一个值多次存储，如上所示)。

另一个想法是在df1中制作一个tempCol，我将colImportant中的所有值放在df2，并带有,这样的分隔符(但是如何，我需要将多行合并为一行df2，而不是按'ON'合并 dfs)。然后，当我想通过某些条件更新df1中的行时，我必须检查拆分的值。完成的，我可以删除tempCol. 这应该看起来像：

Table 1                                
ON      colA  colB  colToUpdate tempCol       
1.2.3    abc   123               inf,mat       
2.9.6    ert   987               mat,tr,ch      
3.5.0    nms   021               inf

这是我的方法：

tmp_df = df2.groupby('ON').colImportant.apply(lambda x: 'OK' if (~x.isin(myValues)).any() 
else np.nan)
df1=df1.merge(tmp_df.reset_index()[['colImportant']], 
left_on=df1.ON, 
right_on=tmp_df.index).drop('key_0', axis=1)

输出：

+----+-------+--------+----------------+
|    | ON    | colA   | colImportant   |
|----+-------+--------+----------------|
|  0 | 1.2.3 | abc    | OK             |
|  1 | 2.9.6 | ert    | OK             |
|  2 | 3.5.0 | nms    | nan            |
+----+-------+--------+----------------+

不完美，但我认为你可以解决这个问题。

相关内容

最新更新

热门标签：