如何在Python中将非结构化的uk地址解析为地址组件



我发现了这个库,它将非结构化的US地址解析为python中的地址组件-https://usaddress.readthedocs.io/en/latest/

英国地址有类似的图书馆吗?

Guildford Cathedral Enterprises Limited The Cathedral Church Of The Holy Spirit, Stag Hill The Chase GU2 7UP

我有一个非常糟糕的解决方案。

address_types = {
"address_line_1": ["street_number", "route", "subpremise", "street_address"],
"address_line_2": ["neighborhood", "sublocality", "sublocality_level_1", "sublocality_level_2",
"sublocality_level_3", "sublocality_level_4", "sublocality_level_5"],
"town": ["locality", "postal_town"],
"county": ["administrative_area_level_2", "administrative_area_level_3"],
"postcode": ["postal_code"]
}

谷歌地理定位API在原始地址组件中返回的这些字段与字典的键大致相关。

如果这是一项业务,你可以使用地点搜索,查找地点ID,然后搜索地址

results  = gmaps.find_place(f"{company} + {raw_add}", "textquery",
fields=['name', 'place_id', "types", "formatted_address"])
google_address = results["candidates"][0]["formatted_address"]

这只会给你raw_address(即字符串(,但你有位置ID和公司名称,你可以使用fuzzywuzzy将其与名称和raw_address的设置阈值相匹配,以确认你有正确的位置。

或者还添加对类型的查找

https://developers.google.com/maps/documentation/places/web-service/supported_types

检查表1。

一旦你确认了匹配,你就可以进行

place = gmaps.place(place_id, fields=["address_component"])

是的,它有两个api调用,呜呜。

place["result"][0]["address_components"]是这样构建的:

[
{"name": "1 foo bar lane" , "short_name", "1 foo bar ln", "types":  [street_address,....]},
{"name": "foo barton" , "short_name", "foo barton", "types": ["postal_town"]},
{"name": "FO0 8AR" , "short_name", "FO0 8AR", "types": ["postal_code"]},
]

然后,你可以再次尝试以你认为合适的方式进行匹配。

你也可以给予https://deepparse.org/尝试了一下,但我发现结果与我的数据集不符,这是魔鬼(用户输入(。

公司地址行1地址线2镇县

我几乎总是以"并使用正则表达式来确认邮政编码。首先一个索引到公司,最后一个到邮政编码,regex匹配邮政编码,然后根据可用元素的数量只分配给一些字段:

split_add = raw_address.split(",") 
company = split_add[0]
postcode = split_add[-1]
left_overs = len(split_add[1:-1])
if left_overs == 3:
address_line_2 = split_add[1]
town = split_add[2]
county = split_add[3]
elif left_overs == 2:
town = split_add[1]
county = split_add[2]
end if

如果您真的必须完成所有字段。

我做了一个以我曾经认识的一个人的名字命名的dirty_phil

def dirty_phil(add_dict):
"""
Fills the blank fields with duplicate data from the other fields.
Returns:
"""
fields_order = ["address_line_1", "address_line_2", "town", "county"]
last_val = ""
values = [v for k, v in add_dict.items() if v.strip() and k in fields_order]
new_dict = OrderedDict({})
for i, field in enumerate(fields_order):
try:
new_dict[field] = values[i]
except IndexError:
new_dict[field] = ""
for field in fields_order:
if not add_dict[field] and last_val:
new_dict[field] = last_val
else:
new_dict[field] = add_dict[field]
last_val = new_dict[field]

最新更新