我在.csv
文件中保存了1000多个URL的列表(这些URL用于下载报告(。有些URL有404 error
,我想找到一种方法将它们从列表中删除。
我设法写了一段代码来识别下面哪个URL无效(对于python 3(。然而,我不知道如何自动从列表中删除这些URL,因为有很多URL。非常感谢。
from urllib.request import urlopen
from urllib.error import HTTPError
try:
urlopen("url")
except HTTPError as err:
if err.code == 404:
print ('invalid')
else:
raise
您可以使用另一个列表来保存404 url(如果404 url小于正常url(,然后获得差异集,因此:
from urllib.request import urlopen
from urllib.error import HTTPError
exclude_urls = set()
try:
urlopen("url")
except HTTPError as err:
if err.code == 404:
exclude_urls.add(url)
valid_urls = set(all_urls) - exclude_urls
考虑列表A具有所有url。
A = A.remove("invalid_url")
您可以这样做:
from urllib.request import urlopen
from urllib.error import HTTPError
def load_data(csv_name):
...
def save_data(data,csv_name):
...
links=load_data(csv_name)
new_links=set()
for i in links:
try:
urlopen("url")
except HTTPError as err:
if err.code == 404:
print ('invalid')
else:
new_links.add(i)
save_data( list(new_links),csv_name)
试试这样的东西:
from urllib.request import urlopen
from urllib.error import HTTPError
# 1. Load the CSV file into a list
with open('urls.csv', 'r') as file:
reader = csv.reader(file)
urls = [row[0] for row in reader] # Assuming each row has one URL
# 2. Check each URL for validity using your code
valid_urls = []
for url in urls:
try:
urlopen(url)
valid_urls.append(url)
except HTTPError as err:
if err.code == 404:
print(f'Invalid URL: {url}')
else:
raise # If it's another type of error, raise it so you're aware
# 3. Write the cleaned list back to the CSV file
with open('cleaned_urls.csv', 'w') as file:
writer = csv.writer(file)
for url in valid_urls:
writer.writerow([url])