是否有一种方法可以在抓取页面之前延迟我的网页刮板?



这是我的函数:

def clubList(url,yearCode):
print(url + "/clubs" + yearCode)
response = requests.get(url + "/clubs" + yearCode)
time.sleep(10)
content = response.content
soup = BeautifulSoup(content, "html.parser")
cluburl = []
clubs = []
ul = soup.find_all(
"ul",
attrs={
"class": "block-list-5 block-list-3-m block-list-1-s block-list-1-xs block-list-padding dataContainer"
},
)
u = str(ul)
soup2 = BeautifulSoup(u, "html.parser")
for i, tags in enumerate(soup2.find_all("a")):
cluburl.append(url + str(tags.get("href")))
for i in range(0, len(cluburl)):
cluburl[i] = cluburl[i].replace("overview", "squad")
return cluburl

我试图刮英超联赛网站建立一个数据分析项目的统计数据库。

我当前的链接树是这样的:

https://www.premierleague.com→https://www.premierleague.com/clubs→https://www.premierleague.com/clubs?se=418

"? se = 418,是我添加到链接中的访问代码,用于指定我想查看哪个季节的统计数据,每个季节都有自己独特的代码。

我通过"https://www.premierleague.com";如url和"?se=418"将yearCode返回给我的函数,它应该返回指向该特定赛季的各个俱乐部页面的链接列表。但是,它总是返回当前赛季的俱乐部链接列表。

我注意到,当我直接访问链接https://www.premierleague.com/clubs?se=418时,它首先在当前赛季俱乐部中加载,然后在适当的俱乐部中动态刷新。

所以我认为添加时间延迟可能会奏效,但我猜它是解析请求中页面的内容。get语句,我不确定我应该在哪里添加我的延迟以使此工作。

下面是运行函数需要导入的所有模块:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import locale
import time
locale.setlocale(locale.LC_ALL, "en_US.UTF8")

当执行季节过滤器时,它使用以下API:

GET https://footballapi.pulselive.com/football/teams

需要以下http报头来返回数据:account: premierleagueorigin: https://www.premierleague.com

下面的示例使用API获取俱乐部列表,然后提取俱乐部id和俱乐部名称以生成俱乐部url:

import requests
season = 418
r = requests.get("https://footballapi.pulselive.com/football/teams", 
params = {
"pageSize": 100,
"compSeasons": season,
"compCodeForActivePlayer": "null",
"comps": 1,
"altIds": "true",
"page": 0
},
headers = {
"account": "premierleague",
"origin": "https://www.premierleague.com"
}
)
data = r.json()
print([
f'https://www.premierleague.com/clubs/{int(t["club"]["id"])}/{t["club"]["name"].replace(" ","-")}/squad'
for t in data["content"]
])