我有两个数据帧,第一个是location,
location = pd.DataFrame({'city': ['RIYADH','SEOUL','BUSAN','TOKYO','OSAKA'],
'country': ['Saudi Arabia','South Korea','South Korea','Japan','Japan']})
另一个是客户,
customer = pd.DataFrame({'id': [1001,2002,3003,4004,5005,6006,7007,8008,9009],
'city': ['tokio','Sorth KOREA','riadh','JAPANN','tokyo','osako','Arab Saudi','SEOUL','buSN']})
我想把客户数据框中的位置列中的错字更改为位置数据框中的城市/国家中的错字。因此输出将像这样:
id location
1001 TOKYO
2002 South Korea
3003 RIYADH
4004 Japan
5005 TOKYO
6006 OSAKA
7007 Saudi Arabia
8008 SEOUL
9009 BUSAN
一个可能的解决方案,基于RapidFuzz
:
from rapidfuzz import process
out = (customer.assign(
aux = customer['city']
.map(lambda x:
process.extractOne(x, location['city']+'*'+location['country'])[0])))
out[['aux1', 'aux2']] = out['aux'].str.split(r'*', expand=True)
out['city'] = out.apply(lambda x:
process.extractOne(x['city'], x.loc['aux1':'aux2'])[0], axis=1)
out = out.drop(columns=['aux', 'aux1', 'aux2'])
输出:
id city
0 1001 TOKYO
1 2002 South Korea
2 3003 RIYADH
3 4004 Japan
4 5005 TOKYO
5 6006 OSAKA
6 7007 Saudi Arabia
7 8008 SEOUL
8 9009 BUSAN
编辑
这试图为OP下面的评论提供一个解决方案:
from rapidfuzz import process
def get_match(x, y, score):
match = process.extractOne(x, y)
return np.nan if match[1] < score else match[0]
out = (customer.assign(
aux=customer['city']
.map(lambda x:
process.extractOne(x, location['city']+'*'+location['country'])[0])))
out[['aux1', 'aux2']] = out['aux'].str.split(r'*', expand=True)
out['city'] = out.apply(lambda x: get_match(
x['city'], x.loc['aux1':'aux2'], 92), axis=1)
out = out.drop(columns=['aux', 'aux1', 'aux2'])
输出:
id city
0 1001 NaN
1 2002 NaN
2 3003 NaN
3 4004 NaN
4 5005 TOKYO
5 6006 NaN
6 7007 NaN
7 8008 SEOUL
8 9009 NaN