基于另一个数据帧中的多个列条件创建列



假设我有两个数据帧 - 条件和数据。

import pandas as pd
conditions = pd.DataFrame({'class': [1,2,3,4,4,5,5,4,4,5,5,5],
'primary_lower': [0,0,0,160,160,160,160,160,160,160,160,800],
'primary_upper':[9999,9999,9999,480,480,480,480,480,480,480,480,4000],
'secondary_lower':[0,0,0,3500,6100,3500,6100,0,4800,0,4800,10],
'secondary_upper':[9999,9999,9999,4700,9999,4700,9999,4699,6000,4699,6000,3000],
'group':['A','A','A','B','B','B','B','C','C','C','C','C']})
data = pd.DataFrame({'class':[1,1,4,4,5,5,2],
'primary':[2000,9100,1100,170,300,210,1000],
'secondary':[1232,3400,2400,380,3600,4800,8600]})

我想在"data"表中生成一个新列(组(,根据"条件"表中提供的条件,该列(组(为每一行分配一个组。

条件表的结构使每个组中的行由"OR"连接,列由"AND"连接。例如,要分配组"B":

(类 = 4 和 160<=主要<=480 和 3500<=次要<=4700(

(类 = 4 和 160<=主要<=480 和 6100<=次要<=9999(

(类 = 5 和 160<=主要<=480 和 3500<=次要<=4700(

(类 = 5 和 160<=主要<=480 和 6100<=次要<=9999(

任何与任何条件都不匹配的行都将被分配为组"其他"。因此,最终数据帧将如下所示:

+-------+---------+-----------+-------+
| class | primary | secondary | group |
+-------+---------+-----------+-------+
|     1 |    2000 |      1232 | A     |
|     1 |    9100 |      3400 | A     |
|     4 |    1100 |      2400 | Other |
|     4 |     170 |       380 | C     |
|     5 |     300 |      3600 | B     |
|     5 |     210 |      4800 | C     |
|     2 |    1000 |      8600 | A     |
+-------+---------+-----------+-------+

您可以迭代一个GroupBy对象,并在每个组中采用掩码的并集:

for key, grp in conditions.groupby('group'):
cols = ['class', 'primary_lower', 'primary_upper',
'secondary_lower', 'secondary_upper']
masks = (data['class'].eq(cls) & 
data['primary'].between(prim_lower, prim_upper) & 
data['secondary'].between(sec_lower, sec_upper) 
for cls, prim_lower, prim_upper, sec_lower, sec_upper in 
grp[cols].itertuples(index=False))
data.loc[pd.concat(masks, axis=1).any(1), 'group'] = key
data['group'] = data['group'].fillna('Other')

结果:

print(data)
class  primary  secondary  group
0      1     2000       1232      A
1      1     9100       3400      A
2      4     1100       2400  Other
3      4      170        380      C
4      5      300       3600      C
5      5      210       4800      C
6      2     1000       8600      A

请注意index=4的结果与所需的输出不同,因为有多个条件满足数据。

最新更新