如何通过空格分割字符串的字符,然后通过特殊字符和数字分割列表的结果元素,然后再次连接它们?



所以,我想做的是将字符串中的一些单词转换为字典中各自的单词,然后保持原样。例如输入如下:

standarisationn("well-2-34 2   @$%23beach bend com")

我希望输出为:

"well-2-34 2 @$%23bch bnd com"

我使用的代码是:

def standarisationn(addr):
a=re.sub(',', ' ', addr)
lookp_dict = {"allee":"ale","alley":"ale","ally":"ale","aly":"ale",
"arcade":"arc",
"apartment":"apt","aprtmnt":"apt","aptmnt":"apt",
"av":"ave","aven":"ave","avenu":"ave","avenue":"ave","avn":"ave","avnue":"ave",
"beach":"bch",
"bend":"bnd",
"blfs":"blf","bluf":"blf","bluff":"blf","bluffs":"blf",
"boul":"blvd","boulevard":"blvd","boulv":"blvd",
"bottm":"bot","bottom":"bot",
"branch":"br","brnch":"br",
"brdge":"brg","bridge":"brg",
"bypa":"byp","bypas":"byp","bypass":"byp","byps":"byp",
"camp":"cmp",
"canyn":"cny","canyon":"cny","cnyn":"cny",
"southwest":"sw" ,"northwest":"nw"}
temp=re.findall(r"[A-Za-z0-9]+|S", a)
print(temp)
res = []
for wrd in temp:
res.append(lookp_dict.get(wrd,wrd))
res = ' '.join(res)
return str(res) 

但是它给出了错误的输出:

'well - 2 - 34 2 @ $ % 23beach bnd com'

有太多的空格,甚至没有转换"beach"bch"帮助"。这就是问题所在。我想的是首先用空格分割字符串,然后用特殊字符和数字分割结果元素,然后使用字典,然后首先用没有空格的特殊字符连接分隔的字符串,然后用空格连接所有列表。有谁能建议如何去做这个或任何更好的方法吗?

你可以用字典的键来构建你的正则表达式,确保它们没有被包含在另一个单词中(即没有直接在前面或后面加字母):

import re
def standarisationn(addr):
addr = re.sub(r'(,|s+)', " ", addr)
lookp_dict = {"allee":"ale","alley":"ale","ally":"ale","aly":"ale",
"arcade":"arc",
"apartment":"apt","aprtmnt":"apt","aptmnt":"apt",
"av":"ave","aven":"ave","avenu":"ave","avenue":"ave","avn":"ave","avnue":"ave",
"beach":"bch",
"bend":"bnd",
"blfs":"blf","bluf":"blf","bluff":"blf","bluffs":"blf",
"boul":"blvd","boulevard":"blvd","boulv":"blvd",
"bottm":"bot","bottom":"bot",
"branch":"br","brnch":"br",
"brdge":"brg","bridge":"brg",
"bypa":"byp","bypas":"byp","bypass":"byp","byps":"byp",
"camp":"cmp",
"canyn":"cny","canyon":"cny","cnyn":"cny",
"southwest":"sw" ,"northwest":"nw"}
for wrd in lookp_dict:
addr = re.sub(rf'(?:^|(?<=[^a-zA-Z])){wrd}(?=[^a-zA-Z]|$)', lookp_dict[wrd], addr)
return addr
print(standarisationn("well-2-34 2   @$%23beach bend com"))

表达式由三部分组成:

  • ^匹配字符串
  • 的开头
  • (?<=[^a-zA-Z])是一个后看(即非捕获表达式),检查前面的字符是一个字母
  • {wrd}是你的字典的键
  • (?=[^a-zA-Z]|$)是一个前瞻性(即非捕获表达式),检查下面的字符是一个字母或字符串
  • 的结尾输出:

well-2-34 2 @$%23bch bnd com

编辑:如果你用

替换循环,你可以编译整个表达式并只使用re.sub一次:
repl_pattern = re.compile(rf"(?:^|(?<=[^a-zA-Z]))({'|'.join(lookp_dict.keys())})(?=([^a-zA-Z]|$))")
addr = re.sub(repl_pattern, lambda x: lookp_dict[x.group(1)], addr)

如果你的字典不断增长,这应该会快得多,因为我们用你所有的字典键构建了一个单一的表达式:

  • ({'|'.join(lookp_dict.keys())})被解释为(allee|alley|...
  • re.sub中的lambda函数用lookp_dict中的相应值替换匹配的元素(参见此链接以获取更多详细信息)

相关内容

  • 没有找到相关文章

最新更新