当每行有多个国家时,将国家转换为大陆



我有一个包含国家的列,其中每行列出了多个国家。我想把每个国家转换成大洲。在过去,我使用过国家转换器,但是当我在本例中尝试使用它时,我得到了一个错误,因为每行有多个国家。

我该如何解决这个问题?

!pip install country_converter --upgrade
import pandas as pd
import country_converter as coco
import pycountry_convert as pc
df = pd.DataFrame()
df['country']=['United States, Canada, England', 'United Kingdom, Spain, South Korea', 'Spain', 'France, Sweden']
# CONVERT COUNTRY TO ISO COUNTRY
cc = coco.CountryConverter()
# Create a list of country names for the dataframe
country = []
for name in df['country']:
country.append(name)

# Converting country names to ISO 3    
iso_alpha = cc.convert(names = country, to='ISO3')
# CONVERT ISO COUNTRY TO CONTENENT
def country_to_continent(country_name):
country_alpha2 = pc.country_name_to_country_alpha2(country_name)
country_continent_code = pc.country_alpha2_to_continent_code(country_alpha2)
country_continent_name = pc.convert_continent_code_to_continent_name(country_continent_code)
return country_continent_name
# converting to contenents
contenent=[]
for iso in iso_alpha:
try:
country_name = iso
contenent.append(country_to_continent(country_name))
except:
contenent.append('other')
# add contenents to original dataframe
df['Contenent']=contenent

假设我理解正确,您希望将结果返回到DataFrame中。因此,每行将包含与相应国家相匹配的多个大洲。

如果是这样,您需要拆分每一行,然后拆分字符串,以便每个国家可以单独处理,然后在放回DataFrame之前逐行连接。

注意事项:

  • "England"没有被发现是一个国家,所以会被贴上"其他"的标签。如果使用IDE,则执行窗口将显示警告。我没有尝试修复这个。
  • CountryConverter的转换将返回一个字符串,如果只有一个国家,所以必须检查返回类型。
  • 我移动了"到顶部,所以主要代码是在底部。

下面是为我工作的代码:

import pandas as pd
import country_converter as coco
import pycountry_convert as pc
# CONVERT ISO COUNTRY TO CONTENENT
def country_to_continent(country_name):
country_alpha2 = pc.country_name_to_country_alpha2(country_name)
country_continent_code = pc.country_alpha2_to_continent_code(country_alpha2)
country_continent_name = pc.convert_continent_code_to_continent_name(country_continent_code)
return country_continent_name

# ------ MAIN -------
df = pd.DataFrame()
df['country']=['United States, Canada, England', 'United Kingdom, Spain, South Korea', 'Spain', 'France, Sweden']
# CONVERT COUNTRY TO ISO COUNTRY
cc = coco.CountryConverter()
# Create a list of country names for the dataframe
cont_list=[]
for arow in df['country']:
country = []
arowarr = arow.split(", ")
for aname in arowarr:
country.append(aname)
#print(f'org:{arow} split:{country}')
# Converting country names to ISO 3    
iso_alpha = cc.convert(names = country, to='ISO3')
#print(f'iso_alpha:{iso_alpha} type:{type(iso_alpha)}')
# converting to contenents
contenent=[]
if (type(iso_alpha) == type("")):
try:
#print(f'   iso_alpha:{iso_alpha}')
contenent.append(country_to_continent(iso_alpha))
except:
contenent.append('other')
else:
for iso in iso_alpha:
try:
#print(f'   iso:{iso}')
contenent.append(country_to_continent(iso))
except:
contenent.append('other')
# convert array back to string
str_cont = ', '.join(contenent)
#print(f'str_cont:{str_cont}')
cont_list.append(str_cont)
# add contenents to original dataframe
df['Contenent']=cont_list
print(f"DF Contenent: n{df['Contenent']}")

在@Ignatius Reilly的帮助下,我能够弄清楚这个问题。

我还在学习python,所以首先拆分字符串对我来说很容易理解。因为所有的国家都是用逗号隔开的,所以工作起来很简单。

country_split=[]
for x in df['country']:
country_split.append(x.split(','))

然后我意识到我可以将cc.convert从'ISO3'更改为'Continent',这样可以真正简化代码。

输出包含重复的大洲,例如[America, America]。因此,我使用.map(pd.unique)来删除重复的值。

最终代码是:

!pip install country_converter --upgrade
import pandas as pd
import country_converter as coco
df = pd.DataFrame()
df['country']=['United States, Canada', 'United Kingdom, Spain, South Korea', 'Spain', 'France, Sweden']
# Create a list of country names from the dataframe
country_split=[]
for x in df['country']:
country_split.append(x.split(','))
# Converting country names to contenent 
cc = coco.CountryConverter()
iso_alpha_list = [cc.convert(names=name, to='Continent') for name in country_split]
df['continent_split']= iso_alpha_list
df['continent']=df['continent_split'].map(pd.unique)

最新更新