将澳大利亚地址拆分为street_address、郊区、州和邮政编码



我曾从一个网站上抓取地址,但它们的格式不一致,例如:

address = '139 McKinnon Road, PINELANDS, NT, 829'
address_2 = '108 East Point Road, Fannie Bay, NT, 820'
address_3 = '3-11 Hamilton Street, Townsville City, QLD, 4810'

我已经尝试通过空间' '来分割它们,但无法得到所需的结果。

I have try:

if "," in address:
raw_address = address.split(",")
splitted_address = [
adr for adr in raw_address if not adr.islower() and not adr.isupper()
]
splitted_suburb = [adr for adr in raw_address if adr.isupper()]
item["Street_Address"] = splitted_address[0].strip()
item["Suburb"] = splitted_address[1].strip()
item["State"] = splitted_suburb[0].strip()
item["Postcode"] = splitted_address[2].strip()
else:
raw_address = address.split(" ")
splitted_address = [
adr for adr in raw_address if not adr.islower() and not adr.isupper()
]
splitted_suburb = [adr for adr in raw_address if adr.isupper()]
item["Street_Address"] = " ".join(splitted_address[:-1])
item["Suburb"] = splitted_suburb[0]
item["State"] = splitted_suburb[1]
item["Postcode"] = splitted_address[-1]

我想要的输出应该是这样的:

Street_Address,Suburb,State,Postcode
Units 1-14, 29 Wiltshire Lane, DELACOMBE, VIC, 3356

如何将完整地址拆分为这些特定字段?

更新:我已经使用正则表达式模式解析出所需的字段:

regex_str = "(^.*?(?:Lane|Street|Boulevard|Crescent|Place|Road|Highway|Avenue|Drive|Circuit|Parade|Telopea|Nicklin Way|Terrace|Square|Court|Close|Endeavour Way|Esplanade|East|The Centreway|Mall|Quay|Gateway|Low Way|Point|Rd|Morinda|Way|Ave|St|South Steyne|Broadway|HQ|Expressway|Strett|Castlereagh|Meadow Way|Track|Kulkyne Way|Narabang Way|Bank)),? ?(.*?),? ?([A-Z]{3}),? ?(d{,4})$"
matches = re.search(regex_str, full_address)
street, suburb, state, postcode = matches.groups()
item["Street_Address"] = street
item["Suburb"] = suburb
item["State"] = state
item["Postcode"] = postcode

它适用于某些地址,如address_3,但与address_1, address_2这个模式不工作,我得到None类型错误:

File "colliers_sale.py", line 164, in parse_details
street, suburb, state, postcode = matches.groups()
AttributeError: 'NoneType' object has no attribute 'groups'

我该如何解决这个问题?

你可以使用regular expression,但可能需要多个模式,像这样:

import re
match = None
if (match := re.search( r'(.*?d+-d+),? (.+?) ([A-Z ]+) ([A-Z]+) (d+)$', address)):
pass # this match address, address_3, address_4
elif (match := re.search(r'(d+-d+) (.+?), (.+?), ([A-Z]+), (d+)$', address)):
pass # this match address_2
# elif (...another pattern...)
if match:
print( match[1], match[2], match[3], match[4], match[5], sep=' # ')
else:
print( 'nothing match')

尝试're'包。你可以使用像这样的正则表达式

import re 
address = 'Units 1-14, 29 Wiltshire Lane DELACOMBE VIC 3356'
address_2 = '3-11 Hamilton Street, Townsville City, QLD, 4810'
address_3 = '6-10 Mount Street MOUNT DRUITT NSW 2770'
address_4 = '34-36 Fairfield Street FAIRFIELD EAST NSW 2165'
addresses = [address, address_2, address_3, address_4]
for add in addresses:
print(', '.join(re.findall(r"(.*d+-d+)[, ]+(w*s*w+s+w+)[, ]+(w*s*w+)[, ]+(w+)[, ]+(d+)", add)[0]))

re.findall模式部分中的括号将帮助您捕获所需的部分。

最新更新