执行此web抓取任务需要登录表单的哪些元素



我正在尝试登录并抓取一个评分网站。我已经设置了以下代码来访问网站并输入的付费负载:-用户名/电子邮件-密码-csrf_token我是否需要在有效负载中包含其他信息才能登录?

我使用的是python 2.7。我添加了代码来打印脚本打开的最后一页,它打印出登录页面,这让我觉得它从未成功登录。

import requests
from lxml import html
payload = {
"username": "...",
"password": "...",
"csrf_token": "ImE2N2E1YzkzZGU2ZjY3NjQ0YTc4YmZiYWJjNWRiN2Y3MjlhYWZmYjQi.XBvDVg.ALSRF6Ui7Y2L7ST0kQG-CC4HTzQ"
}
session_requests = requests.session()
login_url = "https://www.zipgrade.com/login"
user_url = 'https://www.zipgrade.com/user'
result = session_requests.get(login_url)
# make HTML parse tree from page
tree = html.fromstring(result.text)
authenticity_token = 
list(set(tree.xpath("//input[@name='csrf_token']")))[0]
# send payload through
result = session_requests.post(
login_url,
data = payload,
headers = dict(referer=login_url)
)
result = session_requests.get(
user_url,
headers = dict(referer = user_url)
)
tree = html.fromstring(result.content)
bucket_names = tree.xpath("//div[@class='row']")
print(result.ok)
print(bucket_names[0].text_content().strip())

我希望它能带我去https://www.zipgrade.com/user'页面,但它似乎停留在'https://www.zipgrade.com/login'页。

嗯。。cookie头中似乎传递了一个会话令牌;我只是试着模仿登录,我的请求看起来像这样:

import http.client
conn = http.client.HTTPConnection("www,zipgrade,com")
payload = "username=some%40email.com&password=some%40password&csrf_token=IjhmNWU1Y2EzYWExMjcwM2FiZmY5MjEzOGUwNDQ2N2UxZWQ4ODY0OTMi.XBwSeg.RU2oZBM15U7-ECl1Ldfv7LYlcnQ%5E&origURL="
headers = {
'Connection': "keep-alive",
'Cache-Control': "max-age=0",
'Origin': "https://www.zipgrade.com",
'Upgrade-Insecure-Requests': "1",
'Content-Type': "application/x-www-form-urlencoded",
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
'Referer': "https://www.zipgrade.com/login/",
'Accept-Encoding': "gzip, deflate, br",
'Accept-Language': "en-US,en;q=0.9",
'Cookie': "session=eyJfcGVybWFuZW50Ijp0cnVlLCJjc3JmX3Rva2VuIjp7IiBiIjoiT0dZMVpUVmpZVE5oWVRFeU56QXpZV0ptWmpreU1UTTRaVEEwTkRZM1pURmxaRGc0TmpRNU13PT0ifX0.XBwSeg.EPMMH0CcBMif4qUoxGPKFvcnzRw",
'cache-control': "no-cache",
'Postman-Token': "865a89b0-c5cc-49b1-9e24-df413be64fc0"
}
conn.request("POST", "login,", payload, headers)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8"))

请注意,您的有效载荷是正确的;您正在传递正确的参数;然而,在报头中传递了会话;您需要获得会话令牌并将其与您的头一起传递;

我会提出两个请求,一个是对登录页面的普通请求https://www.zipgrade.com/login/它将返回一个cookie,其中包含您需要的会话参数;解析cookie并提取会话;完成后,恢复您的抓取功能,并确保使用该会话更新头变量;

当你点击会话的URL时,你可以同时从隐藏的输入字段中获取csrf令牌,例如:

通过这种方式,您的第一个呼叫为刮刮呼叫做好准备;通过从cookie和隐藏输入字段中收集动态令牌。

请记住,不同网站上的会话有不同的过期时间;一些会话令牌可以用于多页抓取,而另一些会话令牌则需要在每次跳转时获得一个新的会话。只是一个提示;但我认为这将引导你朝着正确的方向前进。

最新更新