请求和靓汤在网站中找不到属性



我目前正在使用请求和美丽的汤刮profootballreference.com。我遇到了一个我的代码无法识别的领域。确切的url是https://www.pro-football-reference.com/boxscores/201809060phi.htm,代码如下:

game_page = requests.get('https://www.pro-football-reference.com/boxscores/201809060phi.htm')
game_page_soup = BeautifulSoup(game_page.content, 'html.parser')
game_info = game_page_soup.find(id='game_info')
print(game_info)

输出为none。但是这个字段应该返回

<table class="suppress_all sortable stats_table now_sortable" id="game_info" data-cols-to-freeze="0"><thead><tr class="thead onecell"><td class="right center" data-stat="onecell" colspan="2">Game Info</td></tr></thead>
<caption>Game Info Table</caption>
<tbody>
<tr data-row="0"><th scope="row" class="center " data-stat="info">Won Toss</th><td class="center " data-stat="stat">Eagles (deferred)</td></tr>
<tr data-row="1"><th scope="row" class="center " data-stat="info">Roof</th><td class="center " data-stat="stat">outdoors</td></tr>
<tr data-row="2"><th scope="row" class="center " data-stat="info">Surface</th><td class="center " data-stat="stat">grass </td></tr>
<tr data-row="3"><th scope="row" class="center " data-stat="info">Duration</th><td class="center " data-stat="stat">3:19</td></tr>
<tr data-row="4"><th scope="row" class="center " data-stat="info">Attendance</th><td class="center " data-stat="stat"><a href="/years/2018/attendance.htm">69,696</a></td></tr>
<tr data-row="5"><th scope="row" class="center " data-stat="info">Weather</th><td class="center " data-stat="stat">81 degrees, wind 8 mph</td></tr>
<tr data-row="6"><th scope="row" class="center " data-stat="info">Vegas Line</th><td class="center " data-stat="stat">Philadelphia Eagles -1.0</td></tr>
<tr data-row="7"><th scope="row" class="center " data-stat="info">Over/Under</th><td class="center " data-stat="stat">44.5 <b>(under)</b></td></tr>
</tbody></table>

为什么这个不会返回?

表在HTML注释<!-- -->中。要加载它,可以使用下面的例子:

import requests
from bs4 import BeautifulSoup, Comment
url = "https://www.pro-football-reference.com/boxscores/201809060phi.htm"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# find the table inside HTML comment <!-- -->
table = soup.find("h2", text="Game Info").find_next(
text=lambda t: isinstance(t, Comment)
)
table = BeautifulSoup(table, "html.parser").table
# print some data from table:
for tr in table.select("tr"):
print(tr.get_text(strip=True, separator=" "))

打印:

Game Info
Won Toss Eagles (deferred)
Roof outdoors
Surface grass
Duration 3:19
Attendance 69,696
Weather 81 degrees, wind 8 mph
Vegas Line Philadelphia Eagles -1.0
Over/Under 44.5 (under)

你可以试试:

import requests
from bs4 import BeautifulSoup, Comment
url = "https://www.pro-football-reference.com/boxscores/201809060phi.htm"
soup = BeautifulSoup(requests.get(url).content, "lxml")
all=soup.find("div",attrs={"id":"all_game_info"})
#Approach 1
table = all.find(string=lambda text:isinstance(text,Comment))
#Selecting `commented HTML` inside `div` with Id `all_game_info` (<div id="all_game_info">) using `bs4.Comment`
#Approach 2
table=str(all).rsplit("--",2)[1]
#Extrcting `comment` from div by splitting `--` text from right, only 3. And selecting `second item` from it.
table = BeautifulSoup(table, "lxml")
for th,td in zip(table.find_all("th"),table.find_all("td")):
print(th.text," : ",td.text)

两种方法的输出相同:

Won Toss  :  Game Info
Roof  :  Eagles (deferred)
Surface  :  outdoors
Duration  :  grass
Attendance  :  3:19
Weather  :  69,696
Vegas Line  :  81 degrees, wind 8 mph
Over/Under  :  Philadelphia Eagles -1.0