python-urllib,为特定url返回空页面



我在使用urlib的特定链接时遇到问题。下面是我使用的代码示例:

from urllib.request import Request, urlopen
import re
url = ""
req = Request(url)
html_page = urlopen(req).read()
print(len(html_page))

以下是我得到的两个链接的结果:

url = "https://www.dafont.com"
Length: 0
url = "https://www.stackoverflow.com"
Length: 196673

有人知道为什么会发生这种事吗?

尝试使用。你会得到回应的。某些网站是安全的,并且只对某些用户代理做出响应。

from urllib.request import Request, urlopen
url = "https://www.dafont.com"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
req = Request(url, headers=headers)
html_page = urlopen(req).read()
print(len(html_page))

这是作者dafont网站强加的限制。

默认情况下,urllib发送urllib/VVV的用户代理标头,其中VVV是urllib版本号。欲了解更多信息,请参阅:https://docs.python.org/3/library/urllib.request.html许多网站管理员保护自己不受爬虫攻击。它们解析用户代理标头。因此,当他们遇到像urllib/VVV这样的用户代理标头时,他们认为这是一个爬网程序。

测试HEAD方法:

$ curl -A "Python-urllib/2.6" -I https://www.dafont.com
HTTP/1.1 200 OK
Date: Sun, 13 Jun 2021 15:11:53 GMT
Server: Apache
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
Content-Type: text/html
$ curl -I https://www.dafont.com
HTTP/1.1 200 OK
Date: Sun, 13 Jun 2021 15:12:02 GMT
Server: Apache
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
Set-Cookie: PHPSESSID=dcauh0dp1antb7eps1smfg2a76; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Content-Type: text/html

测试GET方法:

$ curl -sSL -A "Python-urllib/2.6" https://www.dafont.com | wc -c
0
$ curl -sSL https://www.dafont.com | wc -c
18543

最新更新