Python请求抓取URL返回404在浏览器内工作时出错

我有一个挂在url上的爬网python脚本：pulsepoint.com/sellers.json

机器人程序使用标准请求来获取内容，但返回错误404。在浏览器中，它是有效的(有一个301重定向，但请求可以跟随它(。我的第一个预感是这可能是请求头的问题，所以我复制了我的浏览器配置。代码看起来像这个

crawled_url="pulsepoint.com"
seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
print(seller_json_url)
myheaders = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache'
}
r = requests.get(seller_json_url, headers=myheaders)
logging.info("  %d" % r.status_code)

但我仍然得到404错误。

我的下一个猜测：

登录？此处未使用
饼干？我看不见

那么他们的服务器是如何阻止我的机器人的呢？顺便说一句，这是一个应该被爬网的url，没有违法之处。。

提前感谢！

您还可以对SSL证书错误执行以下解决方法：

from urllib.request import urlopen
import ssl
import json
#this is a workaround on the SSL error
ssl._create_default_https_context = ssl._create_unverified_context
crawled_url="pulsepoint.com"
seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
print(seller_json_url)
response = urlopen(seller_json_url).read() 
# print in dictionary format
print(json.loads(response))

样本响应：

｛'contact_email'：'PublisherSupport@pulsepoint.com'，'contact_address'：'360 Madison Ave，14th Floor，NY，NY 10017'，'version'：'1.0'，'identifiers'：[｛'name'：'TAG-ID'，'value'：'89ff185a4c4e857c'｝]，'seller_ID'：'508738'，…

…'seller_type'：'PUBLISHER'｝，｛'seller_id'：'562225'，'name'：'EL DIARIO'，'domain'：'impromedia.com'，'seller_type'：'PUBLISHER'｝]｝

您只需直接转到链接并提取数据，无需获取301到正确的链接

import requests
headers = {"Upgrade-Insecure-Requests": "1"}
response = requests.get(
url="https://projects.contextweb.com/sellersjson/sellers.json",
headers=headers,
verify=False,
)

好吧，只是对其他人来说，这是一个强化版的答案，因为：

有些网站希望标题能够得到回复
一些网站使用奇怪的编码
有些网站在未被请求时发送gzippe答案

import urllib
import ssl
import json
from io import BytesIO
import gzip
ssl._create_default_https_context = ssl._create_unverified_context
crawled_url="pulsepoint.com"
seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
req = urllib.request.Request(seller_json_url)
# ADDING THE HEADERS
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0')
req.add_header('Accept','application/json,text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8')
response = urllib.request.urlopen(req)
data=response.read()
# IN CASE THE ANSWER IS GZIPPED
if response.info().get('Content-Encoding') == 'gzip':
buf = BytesIO(data)
f = gzip.GzipFile(fileobj=buf)
data = f.read()
# ADAPTS THE ENCODING TO THE ANSWER
print(json.loads(data.decode(response.info().get_param('charset') or 'utf-8')))

再次感谢！

相关内容

最新更新

热门标签：