用Python抓取.aspx页面会得到404



我是一个网页抓取初学者,我正在尝试抓取这个网页:https://profiles.doe.mass.edu/statereport/ap.aspx

我希望能够在顶部放置一些设置(如地区,2020-2021,计算机科学A,女性),然后下载这些设置的结果数据。

下面是我当前使用的代码:

import requests
from bs4 import BeautifulSoup
url = 'https://profiles.doe.mass.edu/statereport/ap.aspx'
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0"
r = s.get('https://profiles.doe.mass.edu/statereport/ap.aspx')
soup = BeautifulSoup(r.text,"lxml")
data = {i['name']:i.get('value','') for i in soup.select('input[name]')}


data["ctl00$ContentPlaceHolder1$ddReportType"]="DISTRICT",
data["ctl00$ContentPlaceHolder1$ddYear"]="2021",
data["ctl00$ContentPlaceHolder1$ddSubject"]="COMSCA",
data["ctl00$ContentPlaceHolder1$ddStudentGroup"]="F",

p = s.post(url,data=data)

当我打印出p.text时,我得到一个标题为't404 - Page Not Foundrn'的页面和消息

<h2>We are unable to locate information at: <br /><br '
'/>http://profiles.doe.mass.edu:80/statereport/ap.aspxp?ASP.NET_SessionId=bxfgao54wru50zl5tkmfml00</h2>rn'

这是data在我修改它之前的样子:

{'__EVENTVALIDATION': '/wEdAFXz4796FFICjJ1Xc5ZOd9SwSHUlrrW+2y3gXxnnQf/b23Vhtt4oQyaVxTPpLLu5SKjKYgCipfSrKpW6jkHllWSEpW6/zTHqyc3IGH3Y0p/oA6xdsl0Dt4O8D2I0RxEvXEWFWVOnvCipZArmSoAj/6Nog6zUh+Jhjqd1LNep6GtJczTu236xw2xaJFSzyG+xo1ygDunu7BCYVmh+LuKcW56TG5L0jGOqySgRaEMolHMgR0Wo68k/uWImXPWE+YrUtgDXkgqzsktuw0QHVZv7mSDJ31NaBb64Fs9ARJ5Argo+FxJW/LIaGGeAYoDphL88oao07IP77wrmH6t1R4d88C8ImDHG9DY3sCDemvzhV+wJcnU4a5qVvRziPyzqDWnj3tqRclGoSw0VvVK9w+C3/577Gx5gqF21UsZuYzfP4emcqvJ7ckTiBk7CpZkjUjM6Z9XchlxNjWi1LkzyZ8QMP0MaNCP4CVYJfndopwFzJC7kI3W106YIA/xglzXrSdmq6/MDUCczeqIsmRQGyTOkQFH724RllsbZyHoPHYvoSAJilrMQf6BUERVN4ojysx3fz5qZhZE7DWaJAC882mXz4mEtcevFrLwuVPD7iB2v2mlWoK0S5Chw4WavlmHC+9BRhT36jtBzSPRROlXuc6P9YehFJOmpQXqlVil7C9OylT4Kz5tYzrX9JVWEpeWULgo9Evm+ipJZOKY2YnC41xTK/MbZFxsIxqwHA3IuS10Q5laFojoB+e+FDCqazV9MvcHllsPv2TK3N1oNHA8ODKnEABoLdRgumrTLDF8Lh+k+Y4EROoHhBaO3aMppAI52v3ajRcCFET22jbEm/5+P2TG2dhPhYgtZ8M/e/AoXht29ixVQ1ReO/6bhLIM+i48RTmcl76n1mNjfimB8r3irXQGYIEqCkXlUHZ/SNlRYyx3obJ6E/eljlPveWNidFHOaj+FznOh264qDkMm7fF78WBO2v0x+or1WGijWDdQtRy9WRKXchYxUchmBlYm15YbBfMrIB7+77NJV+M6uIVVnCyiDRGj+oPXcTYxqSUCLrOMQyzYKJeu8/hWD0gOdKeoYUdUUJq4idIk+bLYy76sI/N2aK+aXZo/JPQ+23gTHzIlyi4Io7O6kXaULPs8rfo8hpkH1qXyKb/rP2VJBNWgyp8jOMx9px+m4/e2Iecd86E4eN4Rk6OIiwqGp+dMdgntXu5ruRHb1awPlVmDw92dL1P0b0XxJW7EGfMzyssMDhs1VT6K6iMUTHbuXkNGaEG1dP1h4ktnCwGqDLVutU6UuzT6i4nfqnvFjGK9+7Ze8qWIl8SYyhmvzmgpLjdMuF9CYMQ2Aa79HXLKFACsSSm0dyiU1/ZGyII2Fvga9o+nVV1jZam3LkcAPaXEKwEyJXfN/DA7P4nFAaQ+QP+2bSgrcw+/dw+86OhPyG88qyJwqZODEXE1WB5zSOUywGb1/Xed7wq9WoRs6v8rAK5c/2iH7YLiJ4mUVDo+7WCKrzO5+Hsyah3frMKbheY1acRmSVUzRgCnTx7jvcLGR9Jbt6TredqZaWZBrDFcntdg7EHd7imK5PqjUld3iCVjdyO+yLKUkMKiFD85G3vEferg/Q/TtfVBqeTU0ohP9d+CsKOmV/dxVYWEtBcfa9KiN6j4N8pP7+3iUOhajojZ8jV98kxT0zPZlzkpqI4SwR6Ys8d2RjIi5K+oQul4pL5u+zZvX0lsLP9Jl7FeVTfBvST67T6ohz8dl9gBfmmbwnT23SyuFSUGd6ZGaKE+9kKYmuImW7w3ePs7C70yDWHpIpxP/IJ4GHb36LWto2g3Ld3goCQ4fXPu7C4iTiN6b5WUSlJJsWGF4eQkJue8=',
'__VIEWSTATE': '/wEPDwUKLTM0NzY4OTQ4NmRkDwwPzTpuna+yxVhQxpRF4n2+zYKQtotwRPqzuCkRvyU=',
'__VIEWSTATEGENERATOR': '2B6F8D71',
'ctl00$ContentPlaceHolder1$btnViewReport': 'View Report',
'ctl00$ContentPlaceHolder1$hfExport': 'ViewReport',
'leftNavId': '11241',
'quickSearchValue': '',
'runQuickSearch': 'Y',
'searchType': 'QUICK',
'searchtext': ''}

根据类似问题的建议,我尝试使用参数,以各种方式编辑data(以模拟我自己浏览网站时在浏览器中看到的POST请求),并指定ASP.NET_SessionId,但无济于事。

我如何从这个网站获取信息?

这应该是你正在寻找什么我所做的是使用bs4解析HTML数据,然后找到表。然后我得到行,为了更容易处理这些数据,我把它们放入一个字典中。

import requests
from bs4 import BeautifulSoup

url = 'https://profiles.doe.mass.edu/statereport/ap.aspx'
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0"
r = s.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
table = soup.find_all('table')
rows = table[0].find_all('tr')
data = {}
for row in rows:
if row.find_all('th'):
keys = row.find_all('th')
for key in keys:
data[key.text] = []
else:
values = row.find_all('td')
for value in values:
data[keys[values.index(value)].text].append(value.text)
for key in data:
print(key, data[key][:10])
print('n')

输出:

District Name ['Abington', 'Academy Of the Pacific Rim Charter Public (District)', 'Acton-Boxborough', 'Advanced Math and Science Academy Charter (District)', 'Agawam', 'Amesbury', 'Amherst-Pelham', 'Andover', 'Arlington', 'Ashburnham-Westminster']

District Code ['00010000', '04120000', '06000000', '04300000', '00050000', '00070000', '06050000', '00090000', '00100000', '06100000']

Tests Taken ['     100', '     109', '   1,070', '     504', '     209', '     126', '     178', '     986', '     893', '      97']

Score=1 ['      16', '      81', '      12', '      29', '      27', '      18', '       5', '      70', '      72', '       4']

Score=2 ['      31', '      20', '      55', '      74', '      65', '      34', '      22', '     182', '     149', '      23']

Score=3 ['      37', '       4', '     158', '     142', '      55', '      46', '      37', '     272', '     242', '      32']

Score=4 ['      15', '       3', '     344', '     127', '      39', '      19', '      65', '     289', '     270', '      22']

Score=5 ['       1', '       1', '     501', '     132', '      23', '       9', '      49', '     173', '     160', '      16']

% Score 1-2 ['  47.0', '  92.7', '   6.3', '  20.4', '  44.0', '  41.3', '  15.2', '  25.6', '  24.7', '  27.8']

% Score 3-5 ['  53.0', '   7.3', '  93.7', '  79.6', '  56.0', '  58.7', '  84.8', '  74.4', '  75.3', '  72.2']

Process finished with exit code 0

我能够通过调整这里的代码使其工作。我不确定为什么以这种方式编辑有效负载会产生不同,所以我将非常感谢任何见解!

下面是我的工作代码,使用Pandas来解析表:

import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://profiles.doe.mass.edu/statereport/ap.aspx'
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0"

response = s.get(url)
soup = BeautifulSoup(response.content, 'html5lib')
data = { tag['name']: tag['value'] 
for tag in soup.select('input[name^=ctl00]') if tag.get('value')
}
state = { tag['name']: tag['value'] 
for tag in soup.select('input[name^=__]')
}

payload = data.copy()
payload.update(state)

payload["ctl00$ContentPlaceHolder1$ddReportType"]="DISTRICT",
payload["ctl00$ContentPlaceHolder1$ddYear"]="2021",
payload["ctl00$ContentPlaceHolder1$ddSubject"]="COMSCA",
payload["ctl00$ContentPlaceHolder1$ddStudentGroup"]="F",

p = s.post(url,data=payload)
df = pd.read_html(p.text)[0]

df["District Code"] = df["District Code"].astype(str).str.zfill(8)
display(df)

最新更新