如何用beautifulsoup刮某个表,变成熊猫数据框?



我该如何使用bs4获取"每场比赛数据"呢?表在这里把它变成一个pandas数据框架?

我已经试过了

url = 'https://www.basketball-reference.com/leagues/NBA_2021.html'
page = requests.get(url)
page
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

,我被困在那里。

谢谢。

使用pd.read_html:

import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.basketball-reference.com/leagues/NBA_2021.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table', id='per_game-team')
df = pd.read_html(str(table))[0]

您想要的表具有id 'per_game-team'。使用浏览器的开发工具中的检查器来找到它。

输出:

>>> df.head(10)
Rk                     Team   G     MP  ...  BLK   TOV    PF    PTS
0   1.0         Milwaukee Bucks*  72  240.7  ...  4.6  13.8  17.3  120.1
1   2.0           Brooklyn Nets*  72  241.7  ...  5.3  13.5  19.0  118.6
2   3.0      Washington Wizards*  72  241.7  ...  4.1  14.4  21.6  116.6
3   4.0               Utah Jazz*  72  241.0  ...  5.2  14.2  18.5  116.4
4   5.0  Portland Trail Blazers*  72  240.3  ...  5.0  11.1  18.9  116.1
5   6.0            Phoenix Suns*  72  242.8  ...  4.3  12.5  19.1  115.3
6   7.0           Indiana Pacers  72  242.4  ...  6.4  13.5  20.2  115.3
7   8.0          Denver Nuggets*  72  242.8  ...  4.5  13.5  19.1  115.1
8   9.0     New Orleans Pelicans  72  242.1  ...  4.4  14.6  18.0  114.6
9  10.0    Los Angeles Clippers*  72  240.0  ...  4.1  13.2  19.2  114.0
[10 rows x 25 columns]

pandas.read_html()是这里的方式(因为它在引擎盖下使用了BeautifulSoup)。而且,由于它已经包含了请求,您实际上可以将Corral提供的解决方案简化为:

import pandas as pd
url = 'https://www.basketball-reference.com/leagues/NBA_2021.html'
df = pd.read_html(url, attrs = {'id': 'per_game-team'})[0]

但是由于您特别询问如何使用bs4转换为数据帧,我将提供该解决方案。

执行此操作的基本逻辑/步骤如下:

  1. 获取table标签
  2. 从表对象中,从<thead>标签下的<th>标签中获取标题名称
  3. 遍历行(<tr>标签)并从每行获取<td>内容

代码:

import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.basketball-reference.com/leagues/NBA_2021.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'id':'per_game-team'})
headers = [x.text for x in table.find('thead').find_all('th')]
data = []
table_body_rows = table.find('tbody').find_all('tr')
for row in table_body_rows:
rank = [row.find('th').text]
row_data = rank + [x.text for x in row.find_all('td')]
data.append(row_data)

df = pd.DataFrame(data, columns=headers)

输出:

print(df)
Rk                     Team   G     MP    FG  ...  STL  BLK   TOV    PF    PTS
0    1         Milwaukee Bucks*  72  240.7  44.7  ...  8.1  4.6  13.8  17.3  120.1
1    2           Brooklyn Nets*  72  241.7  43.1  ...  6.7  5.3  13.5  19.0  118.6
2    3      Washington Wizards*  72  241.7  43.2  ...  7.3  4.1  14.4  21.6  116.6
3    4               Utah Jazz*  72  241.0  41.3  ...  6.6  5.2  14.2  18.5  116.4
4    5  Portland Trail Blazers*  72  240.3  41.3  ...  6.9  5.0  11.1  18.9  116.1
5    6            Phoenix Suns*  72  242.8  43.3  ...  7.2  4.3  12.5  19.1  115.3
6    7           Indiana Pacers  72  242.4  43.3  ...  8.5  6.4  13.5  20.2  115.3
7    8          Denver Nuggets*  72  242.8  43.3  ...  8.1  4.5  13.5  19.1  115.1
8    9     New Orleans Pelicans  72  242.1  42.5  ...  7.6  4.4  14.6  18.0  114.6
9   10    Los Angeles Clippers*  72  240.0  41.8  ...  7.1  4.1  13.2  19.2  114.0
10  11           Atlanta Hawks*  72  241.7  40.8  ...  7.0  4.8  13.2  19.3  113.7
11  12         Sacramento Kings  72  240.3  42.6  ...  7.5  5.0  13.4  19.4  113.7
12  13    Golden State Warriors  72  240.3  41.3  ...  8.2  4.8  15.0  21.2  113.7
13  14      Philadelphia 76ers*  72  242.1  41.4  ...  9.1  6.2  14.4  20.2  113.6
14  15       Memphis Grizzlies*  72  241.7  42.8  ...  9.1  5.1  13.3  18.7  113.3
15  16          Boston Celtics*  72  241.4  41.5  ...  7.7  5.3  14.1  20.4  112.6
16  17        Dallas Mavericks*  72  240.3  41.1  ...  6.3  4.3  12.1  19.4  112.4
17  18   Minnesota Timberwolves  72  241.7  40.7  ...  8.8  5.5  14.3  20.9  112.1
18  19          Toronto Raptors  72  240.3  39.7  ...  8.6  5.4  13.2  21.2  111.3
19  20        San Antonio Spurs  72  242.8  41.9  ...  7.0  5.1  11.4  18.0  111.1
20  21            Chicago Bulls  72  241.4  42.2  ...  6.7  4.2  15.1  18.9  110.7
21  22      Los Angeles Lakers*  72  242.4  40.6  ...  7.8  5.4  15.2  19.1  109.5
22  23        Charlotte Hornets  72  241.0  39.9  ...  7.8  4.8  14.8  18.0  109.5
23  24          Houston Rockets  72  240.3  39.3  ...  7.6  5.0  14.7  19.5  108.8
24  25              Miami Heat*  72  241.4  39.2  ...  7.9  4.0  14.1  18.9  108.1
25  26         New York Knicks*  72  242.1  39.4  ...  7.0  5.1  12.9  20.5  107.0
26  27          Detroit Pistons  72  242.1  38.7  ...  7.4  5.2  14.9  20.5  106.6
27  28    Oklahoma City Thunder  72  241.0  38.8  ...  7.0  4.4  16.1  18.1  105.0
28  29            Orlando Magic  72  240.7  38.3  ...  6.9  4.4  12.8  17.2  104.0
29  30      Cleveland Cavaliers  72  242.1  38.6  ...  7.8  4.5  15.5  18.2  103.8
[30 rows x 25 columns]

最新更新