索引错误:列表索引超出范围-如何跳过损坏的URL



如何告诉我的程序跳过损坏/不存在的URL并继续执行任务?每次运行此程序时,只要遇到不存在的URL并给出错误:索引错误:列表索引超出范围,它就会停止。

URL的范围在1到450之间,但混合中有一些页面已损坏(例如,URL 133不存在(。

import requests
import pandas as pd
import json
from pandas.io.json import json_normalize
from bs4 import BeautifulSoup
df = pd.DataFrame()
for id in range (1, 450):
url = f"https://liiga.fi/api/v1/shotmap/2022/{id}"
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
s = s.replace('null','"placeholder"')
data = json.loads(s)
data = json_normalize(data)
matsit = pd.DataFrame(data)
df = pd.concat([df, matsit], axis=0)

df.to_csv("matsit.csv", index=False)

我假设您的索引错误来自以下语句的代码行:

s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')

你可以这样解决:

try:
s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
except IndexError as IE:
print(f"Indexerror: {IE}")
continue

如果错误没有发生在上面的行上,只需在发生索引错误的行上捕获异常即可。或者,您也可以使用捕获所有异常


try:
code_where_exception_occurs
except Exception as e:
print(f"Exception: {e}")
continue

但我建议尽可能具体,以便以适当的方式处理所有预期的错误。在上面的示例中,将code_where_exception_occurs替换为代码。您也可以将try/except子句放在for循环中的整个代码块周围,但最好单独捕获所有的exeption。这也应该起作用:

try:
url = f"https://liiga.fi/api/v1/shotmap/2022/{id}"
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
s = s.replace('null','"placeholder"')
data = json.loads(s)
data = json_normalize(data)
matsit = pd.DataFrame(data)
df = pd.concat([df, matsit], axis=0)
except Exception as e:
print(f"Exception: {e}")
continue

主要问题是您得到一个204 error(例如:https://liiga.fi/api/v1/shotmap/2022/405)对于某些url,只需使用if-statement来检查和处理:

for i in range (400, 420):
url = f"https://liiga.fi/api/v1/shotmap/2022/{i}"
r=requests.get(url)

if r.status_code != 200:
print(f'Error occured: {r.status_code} on url: {url}')
#### log or do what ever you like to do in case of error
else:
data.append(pd.json_normalize(r.json()))

注意:如中所述https://stackoverflow.com/a/73584487/14460824不需要使用BeautifulSoup,而是直接使用pandas来保持代码干净

示例

import requests, time
import pandas as pd
data = []
for i in range (400, 420):
url = f"https://liiga.fi/api/v1/shotmap/2022/{i}"
r=requests.get(url)

if r.status_code != 200:
print(f'Error occured: {r.status_code} on url: {url}')
else:
data.append(pd.json_normalize(r.json()))
pd.concat(data, ignore_index=True)#.to_csv("matsit", index=False)

输出

Error occured: 204 on url: https://liiga.fi/api/v1/shotmap/2022/405

最新更新