这是我的函数:
def clubList(url,yearCode):
print(url + "/clubs" + yearCode)
response = requests.get(url + "/clubs" + yearCode)
time.sleep(10)
content = response.content
soup = BeautifulSoup(content, "html.parser")
cluburl = []
clubs = []
ul = soup.find_all(
"ul",
attrs={
"class": "block-list-5 block-list-3-m block-list-1-s block-list-1-xs block-list-padding dataContainer"
},
)
u = str(ul)
soup2 = BeautifulSoup(u, "html.parser")
for i, tags in enumerate(soup2.find_all("a")):
cluburl.append(url + str(tags.get("href")))
for i in range(0, len(cluburl)):
cluburl[i] = cluburl[i].replace("overview", "squad")
return cluburl
我试图刮英超联赛网站建立一个数据分析项目的统计数据库。
我当前的链接树是这样的:
https://www.premierleague.com→https://www.premierleague.com/clubs→https://www.premierleague.com/clubs?se=418
"? se = 418,是我添加到链接中的访问代码,用于指定我想查看哪个季节的统计数据,每个季节都有自己独特的代码。
我通过"https://www.premierleague.com";如url和"?se=418"将yearCode返回给我的函数,它应该返回指向该特定赛季的各个俱乐部页面的链接列表。但是,它总是返回当前赛季的俱乐部链接列表。
我注意到,当我直接访问链接https://www.premierleague.com/clubs?se=418时,它首先在当前赛季俱乐部中加载,然后在适当的俱乐部中动态刷新。
所以我认为添加时间延迟可能会奏效,但我猜它是解析请求中页面的内容。get语句,我不确定我应该在哪里添加我的延迟以使此工作。
下面是运行函数需要导入的所有模块:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import locale
import time
locale.setlocale(locale.LC_ALL, "en_US.UTF8")
当执行季节过滤器时,它使用以下API:
GET https://footballapi.pulselive.com/football/teams
需要以下http报头来返回数据:account: premierleague
和origin: https://www.premierleague.com
下面的示例使用API获取俱乐部列表,然后提取俱乐部id和俱乐部名称以生成俱乐部url:
import requests
season = 418
r = requests.get("https://footballapi.pulselive.com/football/teams",
params = {
"pageSize": 100,
"compSeasons": season,
"compCodeForActivePlayer": "null",
"comps": 1,
"altIds": "true",
"page": 0
},
headers = {
"account": "premierleague",
"origin": "https://www.premierleague.com"
}
)
data = r.json()
print([
f'https://www.premierleague.com/clubs/{int(t["club"]["id"])}/{t["club"]["name"].replace(" ","-")}/squad'
for t in data["content"]
])