优化清理地址数据功能



我正在清理用户输入地址数据。

data = {'address':['211 S. 10TH AVE APT 4', 
'11095 FRAZIER DR', 
'1020 BLUEBERRY CT SE ,', 
'7614 202 AVE E',
'8013 SO. ALASKA ST.',
'529 GOLDENTEMPLE PL', '123 LOVE BIRD CT'
]}

我想做的是1。删除前导空格和尾随空格,同时在原始条目2之间保留一个空格。更改为专有或标题大写3。拼写缩写街道名称

在最初的方法中,我成功地实现了前两个目标:

def nospecial(address_text):
import re #use regex
address_text = re.sub("[^a-zA-Z0-9 ]+", "",text) # remove non-alphanumeric characters but leave one space
address_text = address_text.strip().title() #strip leading and trailing white spaces and change to proper cases
return text

我认为for循环将适用于我的第三个目标,为此我将上面的内容修改为:

def st_suffix():
return {'Dr': 'Drive',
'Rd': 'Road', 'Blvd':'Boulevard',
'St':'Street', 'Ste':'Suite',
'Apts': 'Apartments', 'Apt':'Apartment',
'Ct':'Court', 'Cir':'Circle'}

def nospecial(address_text):
import re #use regex
abbv = st_suffix() # get dict
address_text = re.sub("[^a-zA-Z0-9 ]+", "",text) # remove non-alphanumeric characters but leave one space
address_text = address_text.strip().title() #strip leading and trailing white spaces and change to proper cases
for suffix in address: #go through my address text and search for abbreviated keys above and spell out
rep = abbv[address_text] if address_text in abbv.keys() else address_text[suffix] #check dict
return text

对于最后一个版本,我得到了一个TypeError: string indices must be integers。我认为我在for循环线上犯了错误,但我不确定。请帮忙。谢谢

您可以使用

import pandas as pd
data = {'address':['211 S. 10TH AVE APT 4', 
'11095 FRAZIER DR', 
'1020 BLUEBERRY CT SE ,', 
'7614 202 AVE E',
'8013 SO. ALASKA ST.',
'529 GOLDENTEMPLE PL', '123 LOVE BIRD CT'
]}
df=pd.DataFrame(data)
d = {r'bDrb.?': 'Drive',
r'bRdb.?': 'Road', r'bBlvdb.?':'Boulevard',
r'bStb.?':'Street', r'bSteb.?':'Suite',
r'bAptsb.?': 'Apartments', r'bAptb.?':'Apartment',
r'bCtb.?':'Court', r'bCirb.?':'Circle'}
df['address'] = df['address'].str.split().str.join(' ').str.title().replace(d, regex=True)

注:

  • d是一个字典,使用regexp作为键,替换作为值,b表示单词边界,.?匹配可选的点字符
  • .str.split().str.join(' ')-删除前导/尾随空格,并且在字符串中的每个非空格块之间只保留一个空格
  • .str.title()-将字符串转换为标题大小写
  • .replace(d, regex=True)-替换为d字典值

输出:

>>> df['address'].str.split().str.join(' ').str.title().replace(d, regex=True)
0    211 S. 10Th Ave Apartment 4
1            11095 Frazier Drive
2      1020 Blueberry Court Se ,
3                 7614 202 Ave E
4         8013 So. Alaska Street
5            529 Goldentemple Pl
6            123 Love Bird Court

相关内容

  • 没有找到相关文章

最新更新