我曾从一个网站上抓取地址,但它们的格式不一致,例如:
address = '139 McKinnon Road, PINELANDS, NT, 829'
address_2 = '108 East Point Road, Fannie Bay, NT, 820'
address_3 = '3-11 Hamilton Street, Townsville City, QLD, 4810'
我已经尝试通过空间' '
来分割它们,但无法得到所需的结果。
I have try:
if "," in address:
raw_address = address.split(",")
splitted_address = [
adr for adr in raw_address if not adr.islower() and not adr.isupper()
]
splitted_suburb = [adr for adr in raw_address if adr.isupper()]
item["Street_Address"] = splitted_address[0].strip()
item["Suburb"] = splitted_address[1].strip()
item["State"] = splitted_suburb[0].strip()
item["Postcode"] = splitted_address[2].strip()
else:
raw_address = address.split(" ")
splitted_address = [
adr for adr in raw_address if not adr.islower() and not adr.isupper()
]
splitted_suburb = [adr for adr in raw_address if adr.isupper()]
item["Street_Address"] = " ".join(splitted_address[:-1])
item["Suburb"] = splitted_suburb[0]
item["State"] = splitted_suburb[1]
item["Postcode"] = splitted_address[-1]
我想要的输出应该是这样的:
Street_Address,Suburb,State,Postcode
Units 1-14, 29 Wiltshire Lane, DELACOMBE, VIC, 3356
如何将完整地址拆分为这些特定字段?
更新:我已经使用正则表达式模式解析出所需的字段:
regex_str = "(^.*?(?:Lane|Street|Boulevard|Crescent|Place|Road|Highway|Avenue|Drive|Circuit|Parade|Telopea|Nicklin Way|Terrace|Square|Court|Close|Endeavour Way|Esplanade|East|The Centreway|Mall|Quay|Gateway|Low Way|Point|Rd|Morinda|Way|Ave|St|South Steyne|Broadway|HQ|Expressway|Strett|Castlereagh|Meadow Way|Track|Kulkyne Way|Narabang Way|Bank)),? ?(.*?),? ?([A-Z]{3}),? ?(d{,4})$"
matches = re.search(regex_str, full_address)
street, suburb, state, postcode = matches.groups()
item["Street_Address"] = street
item["Suburb"] = suburb
item["State"] = state
item["Postcode"] = postcode
它适用于某些地址,如address_3,但与address_1, address_2这个模式不工作,我得到None类型错误:
File "colliers_sale.py", line 164, in parse_details
street, suburb, state, postcode = matches.groups()
AttributeError: 'NoneType' object has no attribute 'groups'
我该如何解决这个问题?
你可以使用regular expression
,但可能需要多个模式,像这样:
import re
match = None
if (match := re.search( r'(.*?d+-d+),? (.+?) ([A-Z ]+) ([A-Z]+) (d+)$', address)):
pass # this match address, address_3, address_4
elif (match := re.search(r'(d+-d+) (.+?), (.+?), ([A-Z]+), (d+)$', address)):
pass # this match address_2
# elif (...another pattern...)
if match:
print( match[1], match[2], match[3], match[4], match[5], sep=' # ')
else:
print( 'nothing match')
尝试're'包。你可以使用像这样的正则表达式
import re
address = 'Units 1-14, 29 Wiltshire Lane DELACOMBE VIC 3356'
address_2 = '3-11 Hamilton Street, Townsville City, QLD, 4810'
address_3 = '6-10 Mount Street MOUNT DRUITT NSW 2770'
address_4 = '34-36 Fairfield Street FAIRFIELD EAST NSW 2165'
addresses = [address, address_2, address_3, address_4]
for add in addresses:
print(', '.join(re.findall(r"(.*d+-d+)[, ]+(w*s*w+s+w+)[, ]+(w*s*w+)[, ]+(w+)[, ]+(d+)", add)[0]))
re.findall模式部分中的括号将帮助您捕获所需的部分。