如何从列表中删除有错误的URL



我在.csv文件中保存了1000多个URL的列表(这些URL用于下载报告(。有些URL有404 error,我想找到一种方法将它们从列表中删除。

我设法写了一段代码来识别下面哪个URL无效(对于python 3(。然而,我不知道如何自动从列表中删除这些URL,因为有很多URL。非常感谢。

from urllib.request import urlopen
from urllib.error import HTTPError
try:
urlopen("url")
except HTTPError as err:
if err.code == 404:
print ('invalid')
else:
raise 

您可以使用另一个列表来保存404 url(如果404 url小于正常url(,然后获得差异集,因此:

from urllib.request import urlopen
from urllib.error import HTTPError
exclude_urls = set()
try:
urlopen("url")
except HTTPError as err:
if err.code == 404:
exclude_urls.add(url)
valid_urls = set(all_urls) - exclude_urls

考虑列表A具有所有url。

A = A.remove("invalid_url")

您可以这样做:

from urllib.request import urlopen
from urllib.error import HTTPError
def load_data(csv_name):
...
def save_data(data,csv_name):
...
links=load_data(csv_name)
new_links=set()
for i in links:
try:
urlopen("url")
except HTTPError as err:
if err.code == 404:
print ('invalid')
else:
new_links.add(i)
save_data( list(new_links),csv_name)

试试这样的东西:

from urllib.request import urlopen
from urllib.error import HTTPError
# 1. Load the CSV file into a list
with open('urls.csv', 'r') as file:
reader = csv.reader(file)
urls = [row[0] for row in reader]  # Assuming each row has one URL
# 2. Check each URL for validity using your code
valid_urls = []
for url in urls:
try:
urlopen(url)
valid_urls.append(url)
except HTTPError as err:
if err.code == 404:
print(f'Invalid URL: {url}')
else:
raise  # If it's another type of error, raise it so you're aware
# 3. Write the cleaned list back to the CSV file
with open('cleaned_urls.csv', 'w') as file:
writer = csv.writer(file)
for url in valid_urls:
writer.writerow([url])

最新更新