如何遍历<th>脚本中的所有标签以进行网页抓取?



截至目前,我只得到['1']作为下面我当前代码打印的内容的输出。 我想在网站 https://www.baseball-reference.com/teams/NYY/2019.shtmlRk栏中的团队击球表上抢1-54。

我将如何修改colNum以便它可以打印Rk列中的 1-54? 我指出colNum线是因为我觉得问题就在那里,但我可能是错的。

import pandas as pd
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.baseball-reference.com/teams/NYY/2019.shtml')
soup = BeautifulSoup(page.content, 'html.parser')  # parse as HTML page, this is the source code of the page
week = soup.find(class_='table_outer_container')
items = week.find("thead").get_text() # grabs table headers
th = week.find("th").get_text() # grabs Rk only.
tbody = week.find("tbody")
tr = tbody.find("tr")
thtwo = tr.find("th").get_text()
colNum = [thtwo for thtwo in thtwo]
print(colNum)

正如你提到的,你的错误在最后几行。如果我理解正确,您需要"Rk"列中所有值的列表。为了获取所有行,您必须使用find_all()函数。我稍微调整了一下您的代码,以便在以下行中获取每行中第一个字段的文本:

import pandas as pd
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.baseball-reference.com/teams/NYY/2019.shtml')
soup = BeautifulSoup(page.content, 'html.parser')
is the source code of the page
week = soup.find(class_='table_outer_container')
items = week.find("thead").get_text()
th = week.find("th").get_text()
tbody = week.find("tbody")
tr = tbody.find_all("tr")
colnum = [row.find("th").get_text() for row in tr]
print(colnum)

最新更新