需要帮助加入字典项目并删除换行符以及多个空格和特殊字符

具有 2 个 url 及其文本的字典：需要摆脱所有多个空格、特殊字符和换行

符{'https://firsturl.com'： ['\'， ''， ' '， ' '， ' '， '\ '， ''， ''， ''， ''， '首页 |山姆模特公司'， ' \'， '\\'， '\'， ' \\ '， ''， ''， ''， ''， ''， ''， '跳到主要内容']，'https：//secondurl.com#main-content'： ['\'， ''， ' '， ' '， ' '， '\ '， ''， ''， ''， ''， '首页 |即将开始公司'， ' \'， '\\'， '\'， ' \\ '， ''， ''， ''， ''， ''， ''， "跳到主要内容"， ''， ' '， '\ '， '\ '， ' '， '\ '， ''， '\ '， ''， '\ '， ''， "品牌"， ''， "关于我们"， ''， "联合"， ''， '直接响应']}

预期输出： {'https://firsturl.com'： ['Home Sam ModelInc 跳到主要内容']， https：//secondurl.com#main-content'： ['Home Go to Start Inc 跳到主要内容品牌关于我们联合直接响应]}

帮助将不胜感激

因此，让我们尝试演练一下，而不仅仅是向您抛出一些代码。

我们要删除的第一个元素是换行符。因此，我们可以从以下内容开始：

ex_dict = {"a": ["nn", "n"]}
for x in ex_dict:
new_list = [e for e in ex_dict[x] if "n" not in e]
ex_dict[x] = new_list

如果您运行它，您将看到我们现在过滤掉所有新行。

现在我们有以下情况：

Home | Sam ModelInc
Skip to main content
Home | Going to start inc
Brands
About Us
Syndication
Direct Response

根据预期的输出，您希望将所有单词小写并删除非字母字符。

对如何做到这一点做了一些研究。

在代码中，如下所示：

import re
regex = re.compile('[^a-zA-Z ]') # had to tweak the linked solution to include spaces
ex_dict = {"a": ["nn", "n"]}
for x in ex_dict:
new_list = [e for e in ex_dict[x] if "n" not in e]
"""
>>> regex.sub("", "Home | Sam ModelInc")
'Home  Sam ModelInc'
"""
new_list = [regex.sub("", e) for e in new_list]
ex_dict[x] = new_list

所以现在我们的最终new_list看起来像这样：['Home Sam ModelInc', 'Skip to main content']

接下来我们要将所有内容小写。

import re
regex = re.compile('[^a-zA-Z ]') # had to tweak the linked solution to include spaces
ex_dict = {"a": ["nn", "n"]}
for x in ex_dict:
new_list = [e for e in ex_dict[x] if "n" not in e]
"""
>>> regex.sub("", "Home | Sam ModelInc")
'Home  Sam ModelInc'
"""
new_list = [regex.sub("", e) for e in new_list]
new_list = [e.lower() for e in new_list]
ex_dict[x] = new_list

最后，我们希望将所有内容与每个单词之间只有一个空格组合在一起。

import re
regex = re.compile('[^a-zA-Z ]') # had to tweak the linked solution to include spaces
ex_dict = {"a": ["nn", "n"]}
for x in ex_dict:
new_list = [e for e in ex_dict[x] if "n" not in e]
"""
>>> regex.sub("", "Home | Sam ModelInc")
'Home  Sam ModelInc'
"""
new_list = [regex.sub("", e) for e in new_list]
new_list = [e.lower() for e in new_list]
new_list = [" ".join((" ".join(new_list)).split())]
ex_dict[x] = new_list

相关内容

最新更新

热门标签：