如何检查一个数据帧的三列值是否小于另一个标准数据帧的同时值



我有一个标准的数据帧模板df1:

a = [['100', '20', '20'], ['150', '60', '80'], ['200', '100', '100'],['500', '50', '100']]
df1 = pd.DataFrame(a, columns=['A', 'B', 'C'])
df1
Out[50]: 
A    B    C
0  100   20   20
1  150   60   80
2  200  100  100
3  500   50  100

和我的数据集df2

b = [['180.5', '60.3', '20.2'], ['40.2', '2.2', '33.9'], ['205', '100', '100'],['455', '20', '88'],['100', '60', '130'],['10', '10', '10'],['300.64', '66.4', '29.4']]
df2 = pd.DataFrame(b, columns=['A1', 'B1', 'C1'])
df2 
A1    B1    C1
0   180.5  60.3  20.2
1    40.2   2.2  33.9
2     205   100   100
3     455    20    88
4     100    60   130
5      10    10    10
6  300.64  66.4  29.4

我想对df1的所有同时值迭代df2的A1、B1、C1的值,并仅将该组值附加到我的df2,其中所有3个条件都满足(A1<=A,B1<=B,C1<=C(。如果满足df1中多行的条件,那么我只需要第一组标准值。

这可以被视为检查所有3个尺寸是否都在任何可用的标准尺寸模板内。

我的数据集df2:的预期输出

A1    B1    C1      A      B      C
0   180.5  60.3  20.2  200.0  100.0  100.0
1    40.2   2.2  33.9  150.0   60.0   80.0
2     205   100   100    NaN    NaN    NaN
3     455    20    88  500.0   50.0  100.0
4     100    60   130    NaN    NaN    NaN
5      10    10    10  100.0   20.0   20.0
6  300.64  66.4  29.4    NaN    NaN    NaN

找不到任何合适的解决方案。我试着列出清单,但再也无法接近了。

Numpy broadcasting

df1 = df1.astype(float)
df2 = df2.astype(float)
m = (df2.values[:, None] <= df1.values).all(-1)
df2 = df2.join(df1.iloc[m.argmax(1)].set_axis(df2.index)[m.any(1)])

解释

我们可以使用broadcasting来创建一个布尔掩码,方法是将df2中的每一行与df1中的另一行进行比较,然后沿着axis=-1all来减少这个布尔掩码

>>> m
array([[False, False,  True, False],
[False,  True,  True,  True],
[False, False, False, False],
[False, False, False,  True],
[False, False, False, False],
[ True,  True,  True,  True],
[False, False, False, False]])

现在,使用掩模m上的argmaxaxis=1来找到每行中满足所有条件A1<=A, B1<=B, C1<=C的第一个最大值的索引

>>> m.argmax(1)
array([2, 1, 0, 3, 0, 0, 0])

iloc的帮助下从df1中选择与上述索引相对应的行,并使这些所选行的索引与df2的索引一致

>>> df1.iloc[m.argmax(1)].set_axis(df2.index)
A      B      C
0  200.0  100.0  100.0
1  150.0   60.0   80.0
2  100.0   20.0   20.0
3  500.0   50.0  100.0
4  100.0   20.0   20.0
5  100.0   20.0   20.0
6  100.0   20.0   20.0

现在从上面选择的行中过滤行,使得它们满足给定条件A1<=A, B1<=B, C1<=Cjoin过滤的行到df2

>>> df2
A1     B1     C1      A      B      C
0  180.50   60.3   20.2  200.0  100.0  100.0
1   40.20    2.2   33.9  150.0   60.0   80.0
2  205.00  100.0  100.0    NaN    NaN    NaN
3  455.00   20.0   88.0  500.0   50.0  100.0
4  100.00   60.0  130.0    NaN    NaN    NaN
5   10.00   10.0   10.0  100.0   20.0   20.0
6  300.64   66.4   29.4    NaN    NaN    NaN

首先使用cross按所有值联接,然后按条件筛选,按index值删除重复项,并按DataFrame.join:添加到df2

#input data ar strings, so convert to numbers
df1 = df1.astype(float)
df2 = df2.astype(float)
df = pd.merge(df2.reset_index(), df1, how='cross')
df = (df[(df.A1<=df.A) & (df.B1<=df.B) & (df.C1<=df.C)]
.drop_duplicates('index')
.set_index('index')[df1.columns])
print (df)
A      B      C
index                     
0      200.0  100.0  100.0
1      150.0   60.0   80.0
3      500.0   50.0  100.0
5      100.0   20.0   20.0
df = df2.join(df)
print (df)
A1     B1     C1      A      B      C
0  180.50   60.3   20.2  200.0  100.0  100.0
1   40.20    2.2   33.9  150.0   60.0   80.0
2  205.00  100.0  100.0    NaN    NaN    NaN
3  455.00   20.0   88.0  500.0   50.0  100.0
4  100.00   60.0  130.0    NaN    NaN    NaN
5   10.00   10.0   10.0  100.0   20.0   20.0
6  300.64   66.4   29.4    NaN    NaN    NaN

最新更新