尝试打开URL时,Mechanize出现406错误:
for url in urls:
if "http://" not in url:
url = "http://" + url
print url
try:
page = mech.open("%s" % url)
except urllib2.HTTPError, e:
print "there was an error opening the URL, logging it"
print e.code
logfile = open ("log/urlopenlog.txt", "a")
logfile.write(url + "," + "couldn't open this page" + "n")
continue
else:
print "opening this URL..."
page = mech.open(url)
知道什么会导致406错误发生吗?如果我去到问题URL,我可以在浏览器中打开它
尝试根据浏览器发送的内容为请求添加header;首先添加一个Accept
标头(406通常意味着服务器不喜欢你想要接受的内容)。
参见文档中的"添加头文件":
req = mechanize.Request(url)
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
page = mechanize.urlopen(req)
Accept
头值是基于Chrome发送的头
如果你想知道你的浏览器发送了哪些标题,这个网页会显示给你:https://www.whatismybrowser.com/detect/what-http-headers-is-my-browser-sending
'Accept'和'User-Agent'头应该足够了。下面是我消除错误的方法:
#establish counter
j = 0
#Create headers for webpage
headers = {'User-Agent': 'Mozilla/5.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}
#Create for loop to get through list of URLs
for url in URLs:
#Verify scraper agent so that web security systems don't block webpage scraping upon URL opening, with j as a counter
req = mechanize.Request(URLs[j], headers = headers)
#Open the url
page = mechanize.urlopen(req)
#increase counter
j += 1
你也可以尝试导入"urllib2"或"urllib"库来打开这些url。语法是一样的