Pandas:根据字典中的值搜索输入数据



我有2个数据源-配置和输入数据

配置

+------------+-------+
| ContractID | PB ID |
+------------+-------+
| H9500      | 002   |
+------------+-------+
| H9500      | 008   |
+------------+-------+
| H3544      | 800   |
+------------+-------+
| H3544      | 801   |
+------------+-------+
| H3544      | 802   |
+------------+-------+

输入文件

+-------+------------+-------+
| Index | ContractID | PB ID |
+-------+------------+-------+
| 1     | H9500      | 456   |
+-------+------------+-------+
| 2     | H9500      | 008   |
+-------+------------+-------+
| 3     | H9500      | 002   |
+-------+------------+-------+
| 4     | H3544      | 853   |
+-------+------------+-------+
| 5     | H3544      | 802   |
+-------+------------+-------+
| 6     | H4599      | 465   |
+-------+------------+-------+

我正试图通过输入文件循环并检查合同ID是否有效。如果Contract ID = H9500PB ID= 456是无效的,因为该组合不存在于配置文件中。Contract ID = H9500PB ID= 008是有效的,因为它们的组合存在于配置文件中。

我使用下面的逻辑

input_df=pd.DataFrame({'Index': [1,2,3,4,5,6], 
'ContractID' : ['H9500','H9500','H9500','H3544','H3544','H4599'],
'PBID': ['456','008','002','853','802','465']})
config_df=pd.DataFrame({'ContractID':['H9500','H9500','H3544','H3544','H3544'],
'PBID':['002','008','800','801','802']})
config_dict={k: list(v) for k,v in config_df.groupby("ContractID")["PBID"]}
def test_6_3(s):
if config_dict.get(s["ContractID"]) and s["PBID"] not in config_dict.values():
return "Invalid PB Contract"
else:
return "Valid"

input_df['test_6_3'] = input_df.apply(test_6_3, axis=1)
input_df

但是我没有得到预期的结果

+-------+------------+-------+----------+
| Index | ContractID | PB ID | test_6_3 |
+-------+------------+-------+----------+
| 1     | H9500      | 456   | Invalid  |
+-------+------------+-------+----------+
| 2     | H9500      | 008   | Valid    |
+-------+------------+-------+----------+
| 3     | H9500      | 002   | Valid    |
+-------+------------+-------+----------+
| 4     | H3544      | 853   | Invalid  |
+-------+------------+-------+----------+
| 5     | H3544      | 802   | Valid    |
+-------+------------+-------+----------+
| 6     | H4599      | 465   | Invalid  |
+-------+------------+-------+----------+

已更新(基于OP的更正输入):

这应该做你的问题所要求的:

df = input_df.join(config_df.assign(test_6_3='Valid').set_index(['ContractID', 'PBID']), on = ['ContractID', 'PBID']).fillna('Invalid')

输入:

Index ContractID PBID
0      1      H9500  456
1      2      H9500  008
2      3      H9500  002
3      4      H3544  853
4      5      H3544  802
5      6      H4599  465
ContractID PBID
0      H9500  002
1      H9500  008
2      H3544  800
3      H3544  801
4      H3544  802

输出:

Index ContractID PBID test_6_3
0      1      H9500  456  Invalid
1      2      H9500  008    Valid
2      3      H9500  002    Valid
3      4      H3544  853  Invalid
4      5      H3544  802    Valid
5      6      H4599  465  Invalid

更新# 2:

要在OP的问题中坚持apply()方法并使其工作,您可以修改test_6_3()如下:

def test_6_3(s):
if s["ContractID"] in config_dict and s["PBID"] in config_dict[s["ContractID"]]:
return "Valid"
else:
return "Invalid"

输出:

Index ContractID PBID test_6_3
0      1      H9500  456  Invalid
1      2      H9500  008    Valid
2      3      H9500  002    Valid
3      4      H3544  853  Invalid
4      5      H3544  802    Valid
5      6      H4599  465  Invalid
names = ['ContractID', 'PBID']
config = dfc[names].apply(tuple, axis=1).to_list()
config

输出:

[('H9500', 2), ('H9500', 8), ('H3544', 800), ('H3544', 801), ('H3544', 802)]

代码:

(
df.assign(test_6_3=df[names].apply(lambda x: tuple(x) in config, axis=1))
.assign(test_6_3=lambda x: x.test_6_3.map({True: 'Valid', False: 'Invalid'}))
)

输出:

Index ContractID  PBID test_6_3
0      1      H9500   456  Invalid
1      2      H9500     8    Valid
2      3      H9500     2    Valid
3      4      H3544   853  Invalid
4      5      H3544   802    Valid
5      6      H4599   465  Invalid

相关内容

  • 没有找到相关文章

最新更新