清理URL并将其保存到txt文件Python3中

我正在尝试清理和规范文本文件中的URL。

这是我当前的代码：

import re
with open("urls.txt", encoding='utf-8') as f:
content = f.readlines()
content = [x.strip() for x in content]
url_format = "https://www.google"
for item in content:
if not item.startswith(url_format):
old_item = item
new_item = re.sub(r'.*google', url_format, item)
content.append(new_item)
content.remove(old_item)
with open('result.txt', mode='wt', encoding='utf-8') as myfile:
myfile.write('n'.join(content))

问题是，如果我在循环中打印旧项目和新项目，它会显示每个URL都已被清理。但是，当我在循环之外打印URL列表时，URL仍然没有被清除，有些会被删除，有些则不会。

我可以问一下，当我在for循环中删除坏URL并添加干净的URL时，为什么坏URL仍然在列表中吗？也许应该以不同的方式解决这个问题？

此外，我注意到，对于一大组URL，运行代码需要花费大量时间，也许我应该使用不同的工具？

任何帮助都将不胜感激。

这是因为你在迭代列表时从列表中删除项目，这是一件坏事，你可以创建另一个具有新值的列表并附加到它，或者使用索引修改列表，你也可以只使用列表理解来完成这项任务：

content = [item if item.startswith(url_format) else re.sub(r'.*google', url_format, item) for item in content]

或者，使用另一个列表：

new_content = []
for item in content:
if item.startswith(url_format):
new_content.append(item)
else:
new_content.append(re.sub(r'.*google', url_format, item))

或者，使用索引修改列表：

for i, item in enumerate(content):
if not item.startswith(url_format):
content[i] = re.sub(r'.*google', url_format, item)

相关内容

最新更新

热门标签：