我使用Python从网站获取信息。脚本非常简单:
from urllib2 import *
website='http://www.haodf.com'
web=urlopen(website)
content=web.read()#This makes python visit and fetch the content of the website
print content
并返回:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<title>250 Forbidden</title>
</head>
<body>
<h1>250 Forbidden</h1>
</body>
</html>
为什么内容中有"250 Forbidden"?这似乎我不能实际访问网站,虽然这个脚本工作时,处理其他网站,如google.com。
这个特定的网站要求User-Agent
标头与请求一起发送:
>>> import urllib2
>>> request = urllib2.Request("http://www.haodf.com", headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})
>>> print urllib2.urlopen(request).read()
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
...
或切换到requests
(默认发送User-Agent
):
>>> import requests
>>> response = requests.get('http://www.haodf.com')
>>> response.request.headers
CaseInsensitiveDict({'Accept-Encoding': 'gzip, deflate, compress', 'Accept': '*/*', 'User-Agent': 'python-requests/2.2.1 CPython/2.7.5 Darwin/13.3.0'})
>>> print response.text
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
...