我正在编写一个python代码,用于抓取以下网站并查找"total_pages"的值。
网址:https://www.usnews.com/best-colleges/fl
当我在浏览器中打开网站并调查源代码时,total_pages"的值是8。我希望我的python代码能够得到相同的值。
我写了下面的代码:import requests
from bs4 import BeautifulSoup
headers ={'User-Agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
main_site=requests.get("https://www.usnews.com/best-colleges/fl",headers=headers)
main_site_content=main_site.content
main_site_content_soup=BeautifulSoup(main_site_content,"html.parser")
但是我被如何寻找"total_pages"在已解析的数据中。我尝试过find_all()
方法,但没有运气。我想我没有正确地使用这个方法。
注意:解决方案不一定要使用BeautifulSoup。我只是用了BeautifulSoup,因为我对它有点熟悉。
不需要BeautifulSoup。这里我向他们的API请求获取大学列表。
from rich import print
用于美化JSON。它应该更容易阅读。
需要更多帮助或建议,请在下方留言。
import requests
from rich import print
LINK = "https://www.usnews.com/best-colleges/api/search?format=json&location=Florida&_sort=rank&_sortDirection=asc&_page=1"
def get_data(url):
print("Making request to:", url)
response = requests.get(url, timeout=5, headers={"User-Agent": "Mozilla/5.0"})
if response.status_code == 200:
print("Request Successful!")
data = response.json()["data"]
return data["items"], data["next_link"]
print("Request failed!")
return None, None
def main():
print("Starting Scraping...")
items, next_link = get_data(LINK)
# if there's a `next_link`, scrape it.
while next_link is not None:
print("Getting data from:", next_link)
new_items, next_link = get_data(next_link)
items += new_items
# cleaning the data, for the pandas dataframe.
items = [
{
"name": item["institution"]["displayName"],
"state": item["institution"]["state"],
"city": item["institution"]["city"],
}
for item in items
]
df = pd.DataFrame(items)
print(df.to_markdown())
if __name__ == "__main__":
main()
输出如下所示: