比较包含不同国家名称的两列的快速方法

Pandas或Spark中的解决方案都是受欢迎的，我对逻辑很感兴趣。

我的数据帧：

df_1=
col_1    col_2    country
65783    75838    UNITED STATES
57637    83758    UNITED KINGDOM
73456    25356    KOREA, REP. OF
48577    23589    GHANA
48575    24389    SURINAME
df_2 =
col_1    col_2    country
65783    75838    United States of America
57637    83758    England
73456    25356    South Korea
48577    23589    Ghana
48575    24389    England

比较此类数据帧的通用代码(有效(：

import pandas as pd
def matching(df_1, df_2):
df_new = df_2.merge(df_1, on=['col_1', 'col_2'], suffixes=(None, '_actual')).query('country != country_actual')
return df_new

显然，只有最后一行是不匹配的，但考虑到它们是根据不同的公约编写的，而事实上我有数百个国家，我如何才能以某种方式将它们放在一起，以便进行合理的比较？我知道如何一个接一个地更改值，但这对成百上千的人来说是不可能的。

我不知道是否有简单的方法，但country_converter库可以帮助您。它不会重新配置英格兰，但你可以手动更改错误：

import country_converter as coco
some_names = ['United States of America', 'UNITED KINGDOM', 'South Korea', 'Ghana', 'SURINAME',
'KOREA, REP. OF', 'UNITED STATES', 'GHANA']
standard_names = coco.convert(names=some_names, to='name_short')
print(standard_names)

你尝试过模糊匹配吗？我也遇到过类似的问题，我想出了这个。https://github.com/hansalemaos/a_pandas_ex_fuzz/blob/main/__init__.py

它对我有效。

相关内容

最新更新

热门标签：