Python请求抓取URL返回404在浏览器内工作时出错



我有一个挂在url上的爬网python脚本:pulsepoint.com/sellers.json

机器人程序使用标准请求来获取内容,但返回错误404。在浏览器中,它是有效的(有一个301重定向,但请求可以跟随它(。我的第一个预感是这可能是请求头的问题,所以我复制了我的浏览器配置。代码看起来像这个

crawled_url="pulsepoint.com"
seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
print(seller_json_url)
myheaders = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache'
}
r = requests.get(seller_json_url, headers=myheaders)
logging.info("  %d" % r.status_code)

但我仍然得到404错误。

我的下一个猜测:

  • 登录?此处未使用
  • 饼干?我看不见

那么他们的服务器是如何阻止我的机器人的呢?顺便说一句,这是一个应该被爬网的url,没有违法之处。。

提前感谢!

您还可以对SSL证书错误执行以下解决方法:

from urllib.request import urlopen
import ssl
import json
#this is a workaround on the SSL error
ssl._create_default_https_context = ssl._create_unverified_context
crawled_url="pulsepoint.com"
seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
print(seller_json_url)
response = urlopen(seller_json_url).read() 
# print in dictionary format
print(json.loads(response)) 

样本响应:

{'contact_email':'PublisherSupport@pulsepoint.com','contact_address':'360 Madison Ave,14th Floor,NY,NY 10017','version':'1.0','identifiers':[{'name':'TAG-ID','value':'89ff185a4c4e857c'}],'seller_ID':'508738',…

…'seller_type':'PUBLISHER'},{'seller_id':'562225','name':'EL DIARIO','domain':'impromedia.com','seller_type':'PUBLISHER'}]}

您只需直接转到链接并提取数据,无需获取301到正确的链接

import requests
headers = {"Upgrade-Insecure-Requests": "1"}
response = requests.get(
url="https://projects.contextweb.com/sellersjson/sellers.json",
headers=headers,
verify=False,
)

好吧,只是对其他人来说,这是一个强化版的答案,因为:

  • 有些网站希望标题能够得到回复
  • 一些网站使用奇怪的编码
  • 有些网站在未被请求时发送gzippe答案
import urllib
import ssl
import json
from io import BytesIO
import gzip
ssl._create_default_https_context = ssl._create_unverified_context
crawled_url="pulsepoint.com"
seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
req = urllib.request.Request(seller_json_url)
# ADDING THE HEADERS
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0')
req.add_header('Accept','application/json,text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8')
response = urllib.request.urlopen(req)
data=response.read()
# IN CASE THE ANSWER IS GZIPPED
if response.info().get('Content-Encoding') == 'gzip':
buf = BytesIO(data)
f = gzip.GzipFile(fileobj=buf)
data = f.read()
# ADAPTS THE ENCODING TO THE ANSWER
print(json.loads(data.decode(response.info().get_param('charset') or 'utf-8')))

再次感谢!

相关内容

最新更新