我尝试了一个简单的演示来检查geograpy是否能满足我的要求:尝试在非规范化地址中查找国家名称和iso代码(这基本上就是geograpy的目的!(。
问题是,在我进行的测试中,geograpy能够为每个使用的地址找到几个国家,包括在大多数情况下的权利,但我找不到任何类型的参数来决定哪个国家是最多的"正确">
我使用的假地址列表,可能反映了可以分析的现实,是这样的:
- John Doe 115 Huntington Terrace Newark,New York 07112 Stati Uniti
- John Doe 160 Huntington Terrace Newark,纽约07112美利坚合众国
- John Doe 30 Huntington Terrace Newark,美国纽约07112
- John Doe 22 Huntington Terrace Newark,美国纽约07112
- Mario Bianchi,Via Nazionale 25600148罗马(RM(意大利
- Mario Bianchi,Via Nazionale 25600148罗马(RM(意大利
这是编写的简单代码:
import geograpy
ind = ["John Doe 115 Huntington Terrace Newark, New York 07112 Stati Uniti",
"John Doe 160 Huntington Terrace Newark, New York 07112 United States of America",
"John Doe 30 Huntington Terrace Newark, New York 07112 USA",
"John Doe 22 Huntington Terrace Newark, New York 07112 US",
"Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italia",
"Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italy"]
locator = geograpy.locator.Locator()
for address in ind:
places = geograpy.get_place_context(text=address)
print(address)
#print(places)
for country in places.countries:
print("Country:"+country+", IsoCode:"+locator.getCountry(name=country).iso)
print()
这是输出:
John Doe 115 Huntington Terrace Newark, New York 07112 Stati Uniti
Country:United Kingdom, IsoCode:GB
Country:Jamaica, IsoCode:JM
Country:United States, IsoCode:US
John Doe 160 Huntington Terrace Newark, New York 07112 United States of America
Country:United States, IsoCode:US
Country:United Kingdom, IsoCode:GB
Country:Netherlands, IsoCode:NL
Country:Jamaica, IsoCode:JM
Country:Argentina, IsoCode:AR
John Doe 30 Huntington Terrace Newark, New York 07112 USA
Country:United Kingdom, IsoCode:GB
Country:Jamaica, IsoCode:JM
Country:United States, IsoCode:US
John Doe 22 Huntington Terrace Newark, New York 07112 US
Country:United Kingdom, IsoCode:GB
Country:Jamaica, IsoCode:JM
Country:United States, IsoCode:US
Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italia
Country:Australia, IsoCode:AU
Country:Sweden, IsoCode:SE
Country:United States, IsoCode:US
Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italy
Country:Italy, IsoCode:IT
Country:Australia, IsoCode:AU
Country:Sweden, IsoCode:SE
Country:United States, IsoCode:US
首先,最大的问题是在意大利地址(数字4(中根本找不到正确的国家(意大利/意大利(,我不知道找到的三个国家来自哪里。
在大多数情况下,它会发现错误的国家,沉迷于正确的国家,并且我没有任何类型的关于信心百分比、距离或我能理解的东西的指标,如果所在的国家可以被认为是可接受的答案,在多个结果中,可能是";最好的">。
我想提前道歉,但我没有时间深入研究地理3,我不知道这是否是一个愚蠢的问题,但我在文档中没有发现任何关于置信度/概率/距离的信息。
我是作为geograpy3的提交者回答的。
看起来您正试图在第一步中多次使用geograpy Version1的遗留接口,然后才使用定位器。对于您的用例,改进的定位器界面可能更合理。该界面可以使用额外的信息,如人口或人均gdp;最有可能的是";消除歧义的国家。
Stati Uniti/美国-意大利/意大利问题是一个语言问题-参见长期悬而未决的问题https://github.com/ushahidi/geograpy/issues/23geograpy版本1。截至今天,geograpy3中似乎还没有新问题——如果您需要改进,请随时提交。
我在geograpy3项目中将您的示例添加到test_locator.py中,以显示概念上的差异:
def testStackOverflow64379688(self):
'''
compare old and new geograpy interface
'''
examples=['John Doe 160 Huntington Terrace Newark, New York 07112 United States of America',
'John Doe 30 Huntington Terrace Newark, New York 07112 USA',
'John Doe 22 Huntington Terrace Newark, New York 07112 US',
'Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italia',
'Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italy',
'Newark','Rome']
for example in examples:
city=geograpy.locateCity(example,debug=False)
print(city)
结果:
None
None
None
None
None
Newark (US-NJ(New Jersey) - US(United States))
Rome (IT-62(Latium) - IT(Italy))