因为我是python和scrapy的新手。我一直在尝试抓取一个URL碎片化的网站。我正在提出一个帖子请求以获得回复,但不幸的是,它没有给我结果。
def start_requests(self):
try:
form = {'menu': '6'
, 'browseby': '8'
, 'sortby': '2'
, 'media': '3'
, 'ce_id': '1428'
, 'ot_id': '19999'
, 'marker': '354'
, 'getpage': '1'}
head = {
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
# 'Content-Length': '78',
# 'Host': 'onlinelibrary.ectrims-congress.eu',
# 'Accept-Encoding': 'gzip, deflate, br',
# 'Connection': 'keep-alive',
'XMLHttpRequest':'XMLHttpRequest',
}
urls = [
'https://onlinelibrary.ectrims-congress.eu/ectrims/listing/conferences'
]
request_body = urllib.parse.urlencode(form)
print(request_body)
print(type(request_body))
for url in urls:
req = Request(url=url, body= request_body, method='POST', headers=head,callback=self.parse)
req.headers['Cookie'] = 'js_enabled=true; is_cookie_active=true;'
yield req
except Exception as e:
print('the error is {}'.format(e))
我收到一个持续错误的
[scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <POST https://onlinelibrary.ectrims-congress.eu/ectrims/listing/conferences> (failed 4 times): 400 Bad Request
当我试图让邮递员检查时,我得到了预期的输出。有人能帮我吗。
尝试使用FormRequest
而不是Request
。
https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.FormRequest
如果要使用Request
发送POST
请求,则必须使用json.dumps()
将dictionary
转换为string
。
这是一个有效的解决方案:
import scrapy
class EventsSpider(scrapy.Spider):
name = 'events'
def start_requests(self):
form = {'menu': '6', 'browseby': '8', 'sortby': '2', 'media': '3', 'ce_id': '1428', 'ot_id': '19999', 'marker': '354', 'getpage': '1'}
head = {
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'XMLHttpRequest': 'XMLHttpRequest',
}
url = 'https://onlinelibrary.ectrims-congress.eu/ectrims/listing/conferences'
request_body = json.dumps(form)
req = scrapy.Request(url=url, body=request_body, method='POST', headers=head, callback=self.parse)
yield req
def parse(self, response):
print(response.json().keys())
输出:
dict_keys(['html', 'type', 'debug', 'total_pages', 'current_page', 'total_items', 'login'])
额外提示:如果你能在Postman中使用它,你可以点击右侧面板上的"代码"按钮,它看起来像</>
。如果选择Python,您将使用requests
库生成代码。