POST_DATA后未抓取响应页面 - Beautiful Soup & Python - Response page not being scraped after POST_DATA

我正在尝试使用以下代码将数据发布到表单上后抓取网页。

import bs4 as bs
import urllib.request
import requests
import webbrowser
import urllib.parse
url_for_parse = "http://demo.testfire.net/feedback.aspx"
#PARSE THE WEBPAGE
sauce = urllib.request.urlopen(url_for_parse).read()
soup = bs.BeautifulSoup(sauce,"html.parser")
#GET FORM ATTRIBUTES
form = soup.find('form')
action_value = form.get('action')
method_value = form.get('method')
id_value = form.get('id')
#POST DATA
payload = {'txtSearch':'HELLOWORLD'}
r = requests.post(url_for_parse, payload)
#PARSING ACTION VALUE WITH URL
url2 = urllib.parse.urljoin(url_for_parse,action_value)
#READ RESPONSE
response = urllib.request.urlopen(url2)
page_source = response.read()
with open("results.html", "w") as f:
    f.write(str(page_source))
searchfile = open("results.html", "r")
for line in searchfile:
    if "HELLOWORLD" in line: 
        print ("STRING FOUND")
    else:
        print ("STRING NOT FOUND")  
searchfile.close()

代码是正确的。响应网页已成功抓取并存储在结果.html中。

但是，我想在执行post_data后抓取网页。因为每次我运行代码时都会得到结果：找不到字符串。这意味着在执行post_data之前会抓取生成的网页。

如何修改代码，以便表单成功提交，然后将源代码存储在本地文件中。？

对于上述过程，是否建议使用替代框架而不是漂亮的脚本？

你在做什么很明显。

1) You are posting some data to a URL
2) Scraping the same URL.
3) Check for some "String"

但是你应该怎么做。

1) Post data to a URL
2) Scrape the resultant page (Not the same URL) and store in the file
3) Check for some "String"

为此，您需要将 r.content 写入本地文件并搜索字符串

像这样修改代码：

 payload = {'txtSearch':'HELLOWORLD'}
 url2 = urllib.parse.urljoin(url_for_parse,action_value)
 r = requests.post(url2, auth = {"USERNAME", "PASSWORD"}, payload)
  with open("results.html", "w") as f:
        f.write(str(r.content))
//Then continue searching for a String.

注意：您需要将有效负载发送到 url2，而不是初始 URL (url_for_parse(

requests.post 调用后返回的响应将是您要通过的 HTML。您可以通过执行以下操作来访问它

r.content

但是，通过我对此的测试，它说我没有进行身份验证，所以我假设您已经进行身份验证？

我还建议完全使用请求，而不是对 GIT 使用 urllib 和对 POST 的请求。

最好

在请求中保留会话参数。

http://docs.python-requests.org/en/master/user/advanced/#session-objects

import requests
proxies = {
    "http": "",
    "https": "",
}
headers = {
        'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'
}
data = {'item':'content'}
## not that you need basic auth but its simple to toss in requests
auth = requests.auth.HTTPBasicAuth('fake@example.com', 'not_a_real_password') 
s = requests.session()
s.headers.update(headers)
s.proxies.update(proxies)
response = s.post(url=url, data=data, auth=auth)

这个关键位实际上是您正在调用然后等待的内容

<form name="cmt" method="post" action="comment.aspx">

这只是一个帖子 http://demo.testfire.net/comment.aspx

POST_DATA后未抓取响应页面 - Beautiful Soup & Python

相关内容

最新更新

热门标签：