难以从网页(.aspx)上的asp.net表单中使用python请求抓取数据



我正试图从填写表单后返回的多页表中抓取数据。有问题的原始表单的URL为https://ndber.seai.ie/Pass/assessors/search.aspx

发件人https://kaijento.github.io/2017/05/04/web-scraping-requests-eventtarget-viewstate/我得到了从空白表单中提取隐藏变量的代码,这些变量随后与POST请求一起发送,以获得数据

import requests
from bs4 import BeautifulSoup
url='https://ndber.seai.ie/PASS/Assessors/Search.aspx'
with requests.session() as s:
s.headers['user-agent'] = 'Mozilla/5.0'
r    = s.get(url)
soup = BeautifulSoup(r.content, 'html5lib')
target = 'ctl00$DefaultContent$AssessorSearch$gridAssessors$grid_pager'
# unsupported CSS Selector 'input[name^=ctl00][value]'
data = { tag['name']: tag['value'] 
for tag in soup.select('input[name^=ctl00]') if tag.get('value')
}
state = { tag['name']: tag['value'] 
for tag in soup.select('input[name^=__]')
}
data.update(state)
data['__EVENTTARGET'] = ''
data['__EVENTARGUMENT'] = ''
print(data)
r = s.post(url, data=data)
new_soup = BeautifulSoup(r.content, 'html5lib')
print(new_soup)

初始的.get很好,我得到了空白表单的html,我可以将参数提取到数据中。

然而,.post返回一个html页面,该页面指示发生了没有有用数据的错误。

请注意,结果被分割到多个页面上,当你从一页转到另一页时,以下参数是给定的值

data['__EVENTTARGET'] = 'ctl00$DefaultContent$AssessorSearch$gridAssessors$grid_pager' 
data['__EVENTARGUMENT'] = '1$n' # where n is the number of the age to retrieve

在上面的代码中,我最初只是试图获得结果的第一页,然后一旦成功,我将计算出遍历所有结果并加入它们的循环。

有人知道如何处理这样的案件吗?

谢谢/Colm

您可以使用请求模块从该网站获取遍历多个页面的表格内容。在这种情况下,您必须发送多个带有适当参数的发布请求才能访问内容。

与其他参数不同,有一个关键字ctl00$DefaultContent$AssessorSearch$captcha,其值是动态生成的,不存在于页源中。

但是,您仍然可以使用requests_html库获取该键的值。Fyi、CCD_ 4和CCD_。您只需要使用get_captcha_value()函数一次就可以获得captcha的值,然后您就可以重复使用相同的值,直到最后。

下面的脚本当前从所有页面获取所有名称。您可以修改选择器以获得您感兴趣的其他字段。

这就是你可以走的路:

import requests
from bs4 import BeautifulSoup
from requests_html import HTMLSession
link = 'https://ndber.seai.ie/Pass/assessors/search.aspx'
payload = {
'ctl00$DefaultContent$AssessorSearch$dfSearch$Name': '',
'ctl00$DefaultContent$AssessorSearch$dfSearch$CompanyName': '',
'ctl00$DefaultContent$AssessorSearch$dfSearch$County': '',
'ctl00$DefaultContent$AssessorSearch$dfSearch$searchType': 'rbnDomestic',
'ctl00$DefaultContent$AssessorSearch$dfSearch$Bottomsearch': 'Search'
}
page = 1
def get_captcha_value():
with HTMLSession() as session:
r = session.get(link)
r.html.render(sleep=5)
captcha_value = r.html.find("input[name$='$AssessorSearch$captcha']",first=True).attrs['value']
return captcha_value
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (WindowMozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
payload['__VIEWSTATE'] = soup.select_one("#__VIEWSTATE")['value']
payload['__VIEWSTATEGENERATOR'] = soup.select_one("#__VIEWSTATEGENERATOR")['value']
payload['__EVENTVALIDATION'] = soup.select_one("#__EVENTVALIDATION")['value']
payload['ctl00$forgeryToken'] = soup.select_one("#ctl00_forgeryToken")['value']
payload['ctl00$DefaultContent$AssessorSearch$captcha'] = get_captcha_value()

while True:
res = s.post(link,data=payload)
soup = BeautifulSoup(res.text,"lxml")
if not soup.select_one("table[id$='gridAssessors_gridview'] tr[class$='RowStyle']"): break
for items in soup.select("table[id$='gridAssessors_gridview'] tr[class$='RowStyle']"):
_name = items.select_one("td > span").get_text(strip=True)
print(_name)
page+=1
payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
payload.pop('ctl00$DefaultContent$AssessorSearch$dfSearchAgain$Feedback')
payload.pop('ctl00$DefaultContent$AssessorSearch$dfSearchAgain$Search')
payload['__EVENTTARGET'] = 'ctl00$DefaultContent$AssessorSearch$gridAssessors$grid_pager'
payload['__EVENTARGUMENT'] = f'1${page}'

最新更新