尝试使用 Python3 登录网站



我是Python的新手,所以仍然习惯它提供的一些不同的库。我目前正在尝试使用 urllib 访问网站的 HTML,以便我最终可以从我想登录的帐户中的表中抓取数据。

import urllib.request
link = "websiteurl.com"
login = "email@address.com"
password = "password"
#Access the website of the given address, returns back an HTML file
def access_website(address):
return urllib.request.urlopen(address).read()
html = access_website(link)
print(html)

这个函数返回我

b'<!DOCTYPE html>n<html lang="en">n  <head>n    <meta charset="utf-8">n    <meta http-equiv="X-UA-Compatible" content="IE=edge">n    
<meta name="viewport" content="width=device-width, initial-scale=1">n    <title>Festival Manager</title>n   
<link href="bundle.css" rel="stylesheet">n    <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->n   
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->n    <!--[if lt IE 9]>n      <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>n     
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>n    <![endif]-->n  </head>n  <body>n    
<script src="vendor.js"></script>n    <script src="login.js"></script>n  </body>n</html>n'

问题是我不太确定为什么它给我关于 HTML5 填充程序的部分并回应.js......因为当我访问实际网站并检查 javascript 时,它看起来不像这样,所以它似乎没有返回我实际访问网站时看到的 HTML。

此外,我试图检查当我发送登录信息时它发送了什么样的请求,它没有在检查元素的网络选项卡中向我显示 post 请求。所以我真的不确定我什至如何通过 Python 通过 post 请求发送登录信息来登录?

这是我对Python 3的看法,在没有任何外部库(StackOverflow(的情况下完成。登录后,您可以使用BeautifulSoup或任何其他类型的抓取,如果您在没有3d派对库/模块的情况下登录,您也可以进行抓取。

同样,我的GitHub上的脚本在这里

整个脚本复制如下,如StackOverflow指南:

# Login to website using just Python 3 Standard Library
import urllib.parse
import urllib.request
import http.cookiejar
def scraper_login():
####### change variables here, like URL, action URL, user, pass
# your base URL here, will be used for headers and such, with and without https://
base_url = 'www.example.com'
https_base_url = 'https://' + base_url
# here goes URL that's found inside form action='.....'
#   adjust as needed, can be all kinds of weird stuff
authentication_url = https_base_url + '/login'
# username and password for login
username = 'yourusername'
password = 'SoMePassw0rd!'
# we will use this string to confirm a login at end
check_string = 'Logout'
####### rest of the script is logic
# but you will need to tweak couple things maybe regarding "token" logic
#   (can be _token or token or _token_ or secret ... etc)
# big thing! you need a referer for most pages! and correct headers are the key
headers={"Content-Type":"application/x-www-form-urlencoded",
"User-agent":"Mozilla/5.0 Chrome/81.0.4044.92",    # Chrome 80+ as per web search
"Host":base_url,
"Origin":https_base_url,
"Referer":https_base_url}
# initiate the cookie jar (using : http.cookiejar and urllib.request)
cookie_jar = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_jar))
urllib.request.install_opener(opener)
# first a simple request, just to get login page and parse out the token
#       (using : urllib.request)
request = urllib.request.Request(https_base_url)
response = urllib.request.urlopen(request)
contents = response.read()
# parse the page, we look for token eg. on my page it was something like this:
#    <input type="hidden" name="_token" value="random1234567890qwertzstring">
#       this can probably be done better with regex and similar
#       but I'm newb, so bear with me
html = contents.decode("utf-8")
# text just before start and just after end of your token string
mark_start = '<input type="hidden" name="_token" value="'
mark_end = '">'
# index of those two points
start_index = html.find(mark_start) + len(mark_start)
end_index = html.find(mark_end, start_index)
# and text between them is our token, store it for second step of actual login
token = html[start_index:end_index]
# here we craft our payload, it's all the form fields, including HIDDEN fields!
#   that includes token we scraped earler, as that's usually in hidden fields
#   make sure left side is from "name" attributes of the form,
#       and right side is what you want to post as "value"
#   and for hidden fields make sure you replicate the expected answer,
#       eg. "token" or "yes I agree" checkboxes and such
payload = {
'_token':token,
#    'name':'value',    # make sure this is the format of all additional fields !
'login':username,
'password':password
}
# now we prepare all we need for login
#   data - with our payload (user/pass/token) urlencoded and encoded as bytes
data = urllib.parse.urlencode(payload)
binary_data = data.encode('UTF-8')
# and put the URL + encoded data + correct headers into our POST request
#   btw, despite what I thought it is automatically treated as POST
#   I guess because of byte encoded data field you don't need to say it like this:
#       urllib.request.Request(authentication_url, binary_data, headers, method='POST')
request = urllib.request.Request(authentication_url, binary_data, headers)
response = urllib.request.urlopen(request)
contents = response.read()
# just for kicks, we confirm some element in the page that's secure behind the login
#   we use a particular string we know only occurs after login,
#   like "logout" or "welcome" or "member", etc. I found "Logout" is pretty safe so far
contents = contents.decode("utf-8")
index = contents.find(check_string)
# if we find it
if index != -1:
print(f"We found '{check_string}' at index position : {index}")
else:
print(f"String '{check_string}' was not found! Maybe we did not login ?!")
scraper_login()

关于您的原始代码的简短附加信息... 如果您没有登录页面,这通常就足够了。但是对于现代登录,您通常有cookie,检查引用页面,用户代理代码,令牌,如果不是更多的话(如验证码(。网站不喜欢被抓取,他们与之抗争。它也被称为良好的安全性。

因此,除了像最初那样执行请求之外,您还必须: - 获取页面的cookie,并在登录时将其提供 - 知道页面的引用,通常你可以将登录页面推送到登录操作页面 - 伪造代理,如果您宣布自己是"Python 3"代理(默认(,您可能会立即被阻止 - 刮取令牌(就像我的情况一样(并在登录时将其重新提供 - 打包您的有效负载(用户、通行证、令牌和其他内容(,对其进行正确编码,并将其作为 DATA 提交以触发 POST 方法 -等。

所以是的,使用内置库,一旦涉及登录页面,代码就会有点混乱。 对于 3rd 方,它有点短,但正如我所研究的那样,您再次必须考虑推荐、代理、抓取代币等。没有 lib 会自动做到这一点,因为每个页面的工作方式略有不同(有些需要假代理

,有些不需要,有些有令牌,有些不需要,有些名称不同,等等(。如果你去掉我的代码中的注释和附加内容,并稍微缩短它,你可以让它成为一个接受 5 个参数并且有 15 行或更少的函数。

干杯!

最新更新