根据其他数据框中的列更改错别字列值



我有两个数据帧,第一个是location,

location = pd.DataFrame({'city': ['RIYADH','SEOUL','BUSAN','TOKYO','OSAKA'],
'country': ['Saudi Arabia','South Korea','South Korea','Japan','Japan']})

另一个是客户,

customer = pd.DataFrame({'id': [1001,2002,3003,4004,5005,6006,7007,8008,9009],
'city': ['tokio','Sorth KOREA','riadh','JAPANN','tokyo','osako','Arab Saudi','SEOUL','buSN']})

我想把客户数据框中的位置列中的错字更改为位置数据框中的城市/国家中的错字。因此输出将像这样:

id    location
1001  TOKYO
2002  South Korea
3003  RIYADH
4004  Japan
5005  TOKYO
6006  OSAKA
7007  Saudi Arabia
8008  SEOUL
9009  BUSAN

一个可能的解决方案,基于RapidFuzz:

from rapidfuzz import process
out = (customer.assign(
aux = customer['city']
.map(lambda x: 
process.extractOne(x, location['city']+'*'+location['country'])[0])))
out[['aux1', 'aux2']] = out['aux'].str.split(r'*', expand=True)
out['city'] = out.apply(lambda x: 
process.extractOne(x['city'], x.loc['aux1':'aux2'])[0], axis=1)
out = out.drop(columns=['aux', 'aux1', 'aux2'])

输出:

id          city
0  1001         TOKYO
1  2002   South Korea
2  3003        RIYADH
3  4004         Japan
4  5005         TOKYO
5  6006         OSAKA
6  7007  Saudi Arabia
7  8008         SEOUL
8  9009         BUSAN

编辑

这试图为OP下面的评论提供一个解决方案:

from rapidfuzz import process
def get_match(x, y, score):
match = process.extractOne(x, y)
return np.nan if match[1] < score else match[0]
out = (customer.assign(
aux=customer['city']
.map(lambda x:
process.extractOne(x, location['city']+'*'+location['country'])[0])))
out[['aux1', 'aux2']] = out['aux'].str.split(r'*', expand=True)
out['city'] = out.apply(lambda x: get_match(
x['city'], x.loc['aux1':'aux2'], 92), axis=1)
out = out.drop(columns=['aux', 'aux1', 'aux2'])

输出:

id   city
0  1001    NaN
1  2002    NaN
2  3003    NaN
3  4004    NaN
4  5005  TOKYO
5  6006    NaN
6  7007    NaN
7  8008  SEOUL
8  9009    NaN

最新更新