需要帮助加入字典项目并删除换行符以及多个空格和特殊字符



具有 2 个 url 及其文本的字典:需要摆脱所有多个空格、特殊字符和换行

符{'https://firsturl.com': ['\', '', ' ', ' ', ' ', '\ ', '', '', '', '', '首页 |山姆模特公司', ' \', '\\', '\', ' \\ ', '', '', '', '', '', '', '跳到主要内容'],'https://secondurl.com#main-content': ['\', '', ' ', ' ', ' ', '\ ', '', '', '', '', '首页 |即将开始公司', ' \', '\\', '\', ' \\ ', '', '', '', '', '', '', "跳到主要内容", '', ' ', '\ ', '\ ', ' ', '\ ', '', '\ ', '', '\ ', '', "品牌", '', "关于我们", '', "联合", '', '直接响应']}

预期输出: {'https://firsturl.com': ['Home Sam ModelInc 跳到主要内容'], https://secondurl.com#main-content': ['Home Go to Start Inc 跳到主要内容品牌 关于我们联合直接响应]}

帮助将不胜感激

因此,让我们尝试演练一下,而不仅仅是向您抛出一些代码。

我们要删除的第一个元素是换行符。因此,我们可以从以下内容开始:

ex_dict = {"a": ["nn", "n"]}
for x in ex_dict:
new_list = [e for e in ex_dict[x] if "n" not in e]
ex_dict[x] = new_list

如果您运行它,您将看到我们现在过滤掉所有新行。

现在我们有以下情况:

Home | Sam ModelInc
Skip to main content
Home | Going to start inc
Brands
About Us
Syndication
Direct Response

根据预期的输出,您希望将所有单词小写并删除非字母字符。

对如何做到这一点做了一些研究。

在代码中,如下所示:

import re
regex = re.compile('[^a-zA-Z ]') # had to tweak the linked solution to include spaces
ex_dict = {"a": ["nn", "n"]}
for x in ex_dict:
new_list = [e for e in ex_dict[x] if "n" not in e]
"""
>>> regex.sub("", "Home | Sam ModelInc")
'Home  Sam ModelInc'
"""
new_list = [regex.sub("", e) for e in new_list]
ex_dict[x] = new_list

所以现在我们的最终new_list看起来像这样:['Home Sam ModelInc', 'Skip to main content']

接下来我们要将所有内容小写。

import re
regex = re.compile('[^a-zA-Z ]') # had to tweak the linked solution to include spaces
ex_dict = {"a": ["nn", "n"]}
for x in ex_dict:
new_list = [e for e in ex_dict[x] if "n" not in e]
"""
>>> regex.sub("", "Home | Sam ModelInc")
'Home  Sam ModelInc'
"""
new_list = [regex.sub("", e) for e in new_list]
new_list = [e.lower() for e in new_list]
ex_dict[x] = new_list

最后,我们希望将所有内容与每个单词之间只有一个空格组合在一起。

import re
regex = re.compile('[^a-zA-Z ]') # had to tweak the linked solution to include spaces
ex_dict = {"a": ["nn", "n"]}
for x in ex_dict:
new_list = [e for e in ex_dict[x] if "n" not in e]
"""
>>> regex.sub("", "Home | Sam ModelInc")
'Home  Sam ModelInc'
"""
new_list = [regex.sub("", e) for e in new_list]
new_list = [e.lower() for e in new_list]
new_list = [" ".join((" ".join(new_list)).split())]
ex_dict[x] = new_list

最新更新