我有2个数据源-配置和输入数据
配置
+------------+-------+
| ContractID | PB ID |
+------------+-------+
| H9500 | 002 |
+------------+-------+
| H9500 | 008 |
+------------+-------+
| H3544 | 800 |
+------------+-------+
| H3544 | 801 |
+------------+-------+
| H3544 | 802 |
+------------+-------+
输入文件
+-------+------------+-------+
| Index | ContractID | PB ID |
+-------+------------+-------+
| 1 | H9500 | 456 |
+-------+------------+-------+
| 2 | H9500 | 008 |
+-------+------------+-------+
| 3 | H9500 | 002 |
+-------+------------+-------+
| 4 | H3544 | 853 |
+-------+------------+-------+
| 5 | H3544 | 802 |
+-------+------------+-------+
| 6 | H4599 | 465 |
+-------+------------+-------+
我正试图通过输入文件循环并检查合同ID是否有效。如果Contract ID = H9500
和PB ID= 456
是无效的,因为该组合不存在于配置文件中。Contract ID = H9500
和PB ID= 008
是有效的,因为它们的组合存在于配置文件中。
我使用下面的逻辑
input_df=pd.DataFrame({'Index': [1,2,3,4,5,6],
'ContractID' : ['H9500','H9500','H9500','H3544','H3544','H4599'],
'PBID': ['456','008','002','853','802','465']})
config_df=pd.DataFrame({'ContractID':['H9500','H9500','H3544','H3544','H3544'],
'PBID':['002','008','800','801','802']})
config_dict={k: list(v) for k,v in config_df.groupby("ContractID")["PBID"]}
def test_6_3(s):
if config_dict.get(s["ContractID"]) and s["PBID"] not in config_dict.values():
return "Invalid PB Contract"
else:
return "Valid"
input_df['test_6_3'] = input_df.apply(test_6_3, axis=1)
input_df
但是我没有得到预期的结果
+-------+------------+-------+----------+
| Index | ContractID | PB ID | test_6_3 |
+-------+------------+-------+----------+
| 1 | H9500 | 456 | Invalid |
+-------+------------+-------+----------+
| 2 | H9500 | 008 | Valid |
+-------+------------+-------+----------+
| 3 | H9500 | 002 | Valid |
+-------+------------+-------+----------+
| 4 | H3544 | 853 | Invalid |
+-------+------------+-------+----------+
| 5 | H3544 | 802 | Valid |
+-------+------------+-------+----------+
| 6 | H4599 | 465 | Invalid |
+-------+------------+-------+----------+
已更新(基于OP的更正输入):
这应该做你的问题所要求的:
df = input_df.join(config_df.assign(test_6_3='Valid').set_index(['ContractID', 'PBID']), on = ['ContractID', 'PBID']).fillna('Invalid')
输入:
Index ContractID PBID
0 1 H9500 456
1 2 H9500 008
2 3 H9500 002
3 4 H3544 853
4 5 H3544 802
5 6 H4599 465
ContractID PBID
0 H9500 002
1 H9500 008
2 H3544 800
3 H3544 801
4 H3544 802
输出:
Index ContractID PBID test_6_3
0 1 H9500 456 Invalid
1 2 H9500 008 Valid
2 3 H9500 002 Valid
3 4 H3544 853 Invalid
4 5 H3544 802 Valid
5 6 H4599 465 Invalid
更新# 2:
要在OP的问题中坚持apply()
方法并使其工作,您可以修改test_6_3()
如下:
def test_6_3(s):
if s["ContractID"] in config_dict and s["PBID"] in config_dict[s["ContractID"]]:
return "Valid"
else:
return "Invalid"
输出:
Index ContractID PBID test_6_3
0 1 H9500 456 Invalid
1 2 H9500 008 Valid
2 3 H9500 002 Valid
3 4 H3544 853 Invalid
4 5 H3544 802 Valid
5 6 H4599 465 Invalid
names = ['ContractID', 'PBID']
config = dfc[names].apply(tuple, axis=1).to_list()
config
输出:
[('H9500', 2), ('H9500', 8), ('H3544', 800), ('H3544', 801), ('H3544', 802)]
代码:
(
df.assign(test_6_3=df[names].apply(lambda x: tuple(x) in config, axis=1))
.assign(test_6_3=lambda x: x.test_6_3.map({True: 'Valid', False: 'Invalid'}))
)
输出:
Index ContractID PBID test_6_3
0 1 H9500 456 Invalid
1 2 H9500 8 Valid
2 3 H9500 2 Valid
3 4 H3544 853 Invalid
4 5 H3544 802 Valid
5 6 H4599 465 Invalid