使用地理从描述中提取国家/地区信息



问题:我想从用户描述中提取国家/地区信息。到目前为止,我正在尝试使用地理包。我喜欢输入不是很清楚时的行为,例如在 Evesham 或 Rochdale 中,但是,当用户清除说它的位置在西班牙时,该包会将一些字符串(如Zaragoza, Spain(解释为两次提及。不过,我不知道为什么阿姆斯特丹不作为荷兰的产出......如何改进输出?我错过了什么重要的东西吗?有没有更好的一揽子计划来实现这一点?

数据:我的数据示例是

user_location
2  Socialist Republic of Alachua
3                Hérault, France
4                 Gwalior, India
5                Zaragoza,España
7                     amsterdam 
8                        Evesham
9                       Rochdale

我想得到这样的东西:

user_location country
2  Socialist Republic of Alachua ['USSR', 'United States']
3                Hérault, France ['France']
4                 Gwalior, India ['India'] 
5                Zaragoza,España ['Spain']
7                     amsterdam  ['Holland']
8                        Evesham ['United Kingdom']
9                       Rochdale ['United Kingdom', 'United States']

普雷克斯:

import pandas as pd
import geograpy3
df = pd.DataFrame.from_dict({'user_location': {2: 'Socialist Republic of Alachua', 3: 'Hérault, France', 4: 'Gwalior, India', 5: 'Zaragoza,España', 7: 'amsterdam ', 8: 'Evesham', 9: 'Rochdale'}})
df['country'] = df['user_location'].apply(lambda x: geograpy.get_place_context(text=x).countries if pd.notnull(x) else x)
print(df)
#>                    user_location                                            country
#> 2  Socialist Republic of Alachua  [USSR, Union of Soviet Socialist Republics, Al...
#> 3                Hérault, France                                  [France, Hérault]
#> 4                 Gwalior, India   [British Indian Ocean Territory, Gwalior, India]
#> 5                Zaragoza,España             [Zaragoza, España, Spain, El Salvador]
#> 7                     amsterdam                                                  []
#> 8                        Evesham                          [Evesham, United Kingdom]
#> 9                       Rochdale          [Rochdale, United Kingdom, United States]

创建于 2020-06-02 由 reprexpy 软件包

geograpy3在国家/地区查找方面不再正确,因为它没有检查pycountry是否返回了None。作为提交者,我刚刚解决了这个问题。 我添加了您稍作修改的示例(以避免 pandas 导入(作为单元测试用例:

def testStackoverflow62152428(self):
'''
see https://stackoverflow.com/questions/62152428/extracting-country-information-from-description-using-geograpy?noredirect=1#comment112899776_62152428
'''
examples={2: 'Socialist Republic of Alachua', 3: 'Hérault, France', 4: 'Gwalior, India', 5: 'Zaragoza,España', 7: 'amsterdam ', 8: 'Evesham', 9: 'Rochdale'}  
for index,text in examples.items():
places=geograpy.get_geoPlace_context(text=text)
print("example %d: %s" % (index,places.countries))

结果现在:

example 2: ['United States']
example 3: ['France']
example 4: ['British Indian Ocean Territory', 'India']
example 5: ['Spain', 'El Salvador']
example 7: []
example 8: ['United Kingdom']
example 9: ['United Kingdom', 'United States']

例如5,确实有改进的余地。我 https://github.com/somnathrakshit/geograpy3/issues/7 添加了一个问题 - 请继续关注...

相关内容

  • 没有找到相关文章

最新更新