抓取包含该方法且页面的多个页面的网站_dopostback并且页面的 URL 不会更改



我正在使用BeautifulSoup从https://excise.wb.gov.in/chms/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx?Public_District_Code=019
总共有两页信息,要浏览这些页面,顶部和底部都有几个链接,如1,2。这些链接使用_dopostback

href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView2','Page$2'(">

问题是,当我们尝试从一个页面导航到另一个页面时,Url不会更改,只有粗体文本会更改,即对于第1页,它是Page$1,对于第2页,它为Page$2。如何使用BeautifulSoup在多个页面上进行迭代并提取信息?表单数据如下。

ctl00$ScriptManager1:ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$CntentPlaceHolder 1$GridView2ctl00$ContentPlaceHolder1$ddl_District:019ctl00$ContentPlaceHolder1$rdo_Govt_Flag:G__EVENTTARGET:ctl00$ContentPlaceHolder1$GridView2__活动:第2页

表单数据中还有一个名为_VIEWSTATE的变量,但内容非常庞大。我查看了多个解决方案和帖子,建议查看post调用的参数并使用它们,但我无法理解post中提供的参数。

您可以使用此示例如何使用requests:加载此网站上的下一页

import requests
from bs4 import BeautifulSoup

url = "https://excise.wb.gov.in/chms/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx?Public_District_Code=019"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

def load_page(soup, page_num):
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0",
}
payload = {
"ctl00$ScriptManager1": "ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$GridView2",
"__EVENTTARGET": "ctl00$ContentPlaceHolder1$GridView2",
"__EVENTARGUMENT": "Page${}".format(page_num),
"__LASTFOCUS": "",
"__ASYNCPOST": "true",
}
for inp in soup.select("input"):
payload[inp["name"]] = inp.get("value")
payload["ctl00$ContentPlaceHolder1$ddl_District"] = "019"
payload["ctl00$ContentPlaceHolder1$rdo_Govt_Flag"] = "G"
del payload["ctl00$ContentPlaceHolder1$chk_Available"]
api_url = "https://excise.wb.gov.in/chms/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx?Public_District_Code=019"
soup = BeautifulSoup(
requests.post(api_url, data=payload, headers=headers).content,
"html.parser",
)
return soup

# print hospitals from first page:
for h5 in soup.select("h5"):
print(h5.text)
# load second page
soup = load_page(soup, 2)
# print hospitals from second page
for h5 in soup.select("h5"):
print(h5.text)

打印:

 AMRI, Salt Lake - Vivekananda Yuba Bharati Krirangan Salt Lake Stadium (Satellite Govt. Building)
 Calcutta National Medical College and Hospital (Government Hospital)
 CHITTARANJAN NATIONAL CANCER INSTITUTE-CNCI (Government Hospital)
 College of Medicine  Sagore Dutta Hospital (Government Hospital)
 ESI Hospital Maniktala (Government Hospital)
 ESI Hospital Sealdah (Government Hospital)
 I.D. And B.G. Hospital (Government Hospital)
 M R Bangur Hospital (Government Hospital)
 Medical College and Hospital, Kolkata, (Government Hospital)
 Nil Ratan Sarkar Medical College and Hospital (Government Hospital)
 R. G. Kar Medical College and Hospital  (Government Hospital)
 Sambhunath Pandit Hospital (Government Hospital)

最新更新