使用BeautifulSoup抓取HTML网站并找到其中"total_pages"的价值



我正在编写一个python代码,用于抓取以下网站并查找"total_pages"的值。

网址:https://www.usnews.com/best-colleges/fl

当我在浏览器中打开网站并调查源代码时,total_pages&quot的值是8。我希望我的python代码能够得到相同的值。

我写了下面的代码:
import requests
from bs4 import BeautifulSoup
headers ={'User-Agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
main_site=requests.get("https://www.usnews.com/best-colleges/fl",headers=headers)
main_site_content=main_site.content
main_site_content_soup=BeautifulSoup(main_site_content,"html.parser")

但是我被如何寻找"total_pages"在已解析的数据中。我尝试过find_all()方法,但没有运气。我想我没有正确地使用这个方法。

注意:解决方案不一定要使用BeautifulSoup。我只是用了BeautifulSoup,因为我对它有点熟悉。

不需要BeautifulSoup。这里我向他们的API请求获取大学列表。

from rich import print用于美化JSON。它应该更容易阅读。

需要更多帮助或建议,请在下方留言。

import requests
from rich import print
LINK = "https://www.usnews.com/best-colleges/api/search?format=json&location=Florida&_sort=rank&_sortDirection=asc&_page=1"

def get_data(url):
print("Making request to:", url)
response = requests.get(url, timeout=5, headers={"User-Agent": "Mozilla/5.0"})
if response.status_code == 200:
print("Request Successful!")
data = response.json()["data"]
return data["items"], data["next_link"]
print("Request failed!")
return None, None

def main():
print("Starting Scraping...")
items, next_link = get_data(LINK)
# if there's a `next_link`, scrape it.
while next_link is not None:
print("Getting data from:", next_link)
new_items, next_link = get_data(next_link)
items += new_items
# cleaning the data, for the pandas dataframe.
items = [
{
"name": item["institution"]["displayName"],
"state": item["institution"]["state"],
"city": item["institution"]["city"],
}
for item in items
]
df = pd.DataFrame(items)
print(df.to_markdown())

if __name__ == "__main__":
main()

输出如下所示:

最新更新