如何用美丽汤抓取页面?页面源与检查元素不匹配

我试图从这个梦幻篮球页面中抓取一些东西。我在Python 3.5+中使用BeautifulSoup来做到这一点。

source_code = requests.get('http://fantasy.espn.com/basketball/league/standings?leagueId=633975')
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'lxml')

首先，我想将 9 个类别的标题抓取到 Python 列表中。所以我的列表应该看起来像categories = [FG%, FT%, 3PM, REB, AST, STL, BLK, TO, PTS].

我希望做的是这样的：

tableSubHead = soup.find_all('tr', class_='Table2__header-row')
tableSubHead = tableSubHead[0]
listCats = tableSubHead.find_all('th')
categories = []
for cat in listCats:
if 'title' in cat.attrs:
categories.append(cat.string)

但是，soup.find_all('tr', class_='Table2__header-row')返回一个空列表，而不是我想要的表行元素。我怀疑这是因为当我查看页面源代码时，它与 Chrome 开发工具中的检查元素完全不同。我知道这是因为Javascript动态更改页面上的元素，但我不确定解决方案是什么。

您面临的问题是因为该网站是一个网络应用程序，这意味着javascript必须运行才能生成您所看到的内容，您无法使用request运行javascript，这是我为使用selenium获得结果所做的，该结果打开了一个无头浏览器，并通过等待一段时间使javascript首先运行：

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time
# source_code = requests.get('http://fantasy.espn.com/basketball/league/standings?leagueId=633975')
options = webdriver.ChromeOptions()
options.add_argument('headless')
capa = DesiredCapabilities.CHROME
capa["pageLoadStrategy"] = "none"
driver = webdriver.Chrome(chrome_options=options, desired_capabilities=capa)
driver.set_window_size(1440,900)
driver.get('http://fantasy.espn.com/basketball/league/standings?leagueId=633975')
time.sleep(15)
plain_text = driver.page_source
soup = BeautifulSoup(plain_text, 'lxml')
soup.select('.Table2__header-row') # Returns full results.
len(soup.select('.Table2__header-row')) # 8

这种方法将允许您运行设计为 Web 应用程序的网站，并极大地扩展您的功能。 - 您甚至可以添加要执行的命令，例如滚动或单击以加载航班上的更多源。

使用pip install selenium安装硒。如果您喜欢该浏览器，还允许您使用Firefox。

这可能不完全是您要找的，但由于页面源代码上没有任何内容，因此它并不是那么可用。但是，显然，在加载记分牌时，该网站会进行几次 API 调用，这些调用很可能包含您正在寻找的所有数据。

这里有一个 API 调用，它似乎包含您正在寻找的所有信息。

import requests
payload = {"view":["mMatchupScore","mScoreboard","mSettings","mTeam","modular","mNav"]}
r = requests.get("http://fantasy.espn.com/apis/v3/games/fba/seasons/2019/segments/0/leagues/633975", params=payload).json()
# r is a json object with all the data in it

相关内容

最新更新

热门标签：