我正在清理用户输入地址数据。
data = {'address':['211 S. 10TH AVE APT 4',
'11095 FRAZIER DR',
'1020 BLUEBERRY CT SE ,',
'7614 202 AVE E',
'8013 SO. ALASKA ST.',
'529 GOLDENTEMPLE PL', '123 LOVE BIRD CT'
]}
我想做的是1。删除前导空格和尾随空格,同时在原始条目2之间保留一个空格。更改为专有或标题大写3。拼写缩写街道名称
在最初的方法中,我成功地实现了前两个目标:
def nospecial(address_text):
import re #use regex
address_text = re.sub("[^a-zA-Z0-9 ]+", "",text) # remove non-alphanumeric characters but leave one space
address_text = address_text.strip().title() #strip leading and trailing white spaces and change to proper cases
return text
我认为for循环将适用于我的第三个目标,为此我将上面的内容修改为:
def st_suffix():
return {'Dr': 'Drive',
'Rd': 'Road', 'Blvd':'Boulevard',
'St':'Street', 'Ste':'Suite',
'Apts': 'Apartments', 'Apt':'Apartment',
'Ct':'Court', 'Cir':'Circle'}
def nospecial(address_text):
import re #use regex
abbv = st_suffix() # get dict
address_text = re.sub("[^a-zA-Z0-9 ]+", "",text) # remove non-alphanumeric characters but leave one space
address_text = address_text.strip().title() #strip leading and trailing white spaces and change to proper cases
for suffix in address: #go through my address text and search for abbreviated keys above and spell out
rep = abbv[address_text] if address_text in abbv.keys() else address_text[suffix] #check dict
return text
对于最后一个版本,我得到了一个TypeError: string indices must be integers
。我认为我在for循环线上犯了错误,但我不确定。请帮忙。谢谢
您可以使用
import pandas as pd
data = {'address':['211 S. 10TH AVE APT 4',
'11095 FRAZIER DR',
'1020 BLUEBERRY CT SE ,',
'7614 202 AVE E',
'8013 SO. ALASKA ST.',
'529 GOLDENTEMPLE PL', '123 LOVE BIRD CT'
]}
df=pd.DataFrame(data)
d = {r'bDrb.?': 'Drive',
r'bRdb.?': 'Road', r'bBlvdb.?':'Boulevard',
r'bStb.?':'Street', r'bSteb.?':'Suite',
r'bAptsb.?': 'Apartments', r'bAptb.?':'Apartment',
r'bCtb.?':'Court', r'bCirb.?':'Circle'}
df['address'] = df['address'].str.split().str.join(' ').str.title().replace(d, regex=True)
注:
d
是一个字典,使用regexp作为键,替换作为值,b
表示单词边界,.?
匹配可选的点字符.str.split().str.join(' ')
-删除前导/尾随空格,并且在字符串中的每个非空格块之间只保留一个空格.str.title()
-将字符串转换为标题大小写.replace(d, regex=True)
-替换为d
字典值
输出:
>>> df['address'].str.split().str.join(' ').str.title().replace(d, regex=True)
0 211 S. 10Th Ave Apartment 4
1 11095 Frazier Drive
2 1020 Blueberry Court Se ,
3 7614 202 Ave E
4 8013 So. Alaska Street
5 529 Goldentemple Pl
6 123 Love Bird Court