为什么请求只处理请求而不处理scrapy ?



当我滚动时,我试图抓取加载第2页结果的网页。所以我得到了它运行的api (img)的url,它应该工作得很好。

但它只工作,如果我使用requests库。当我使用与scrapy相同的url运行requests.get()时,我得到响应200,但与scrapy一起返回500状态。我不知道为什么这对scrapy不起作用,有什么解释吗?

这就是我要做的

Obrigado .

import scrapy
import json
import re
class ScrapeVagas(scrapy.Spider):
name = "vagas"
base_url = "https://www.trabalhabrasil.com.br/api/v1.0/Job/List?idFuncao=0&idCidade=5345&pagina=%d&pesquisa=&ordenacao=1&idUsuario="
start_urls = [base_url % 100]
download_delay = 1
def parse(self, response):
vagas = json.loads(response.text)

for vaga in range(0, len(vagas)):
yield {
"vaga": vagas[vaga]["df"],
"salario": re.sub("[R$.]", "", vagas[vaga]["sl"]).strip()
}

您正在获得500 Internal Server Error服务器错误响应代码,表明服务器遇到了阻止它完成请求的意外条件。这里需要请求头来获得正确的响应。查看scrapy shell中的输出。

import scrapy
base_url = "https://www.trabalhabrasil.com.br/api/v1.0/Job/List?idFuncao=0&idCidade=5345&pagina=%d&pesquisa=&o
rdenacao=1&idUsuario="
start_urls = [base_url % 100]
start_urls
url = start_urls[0]
headers = {"USER-AGENT":"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.3",
"referer": "https://www.trabalhabrasil.com.br/vagas-empregos-em-sao-paulo-sp",
"authority": "www.trabalhabrasil.com.br",
"path": "/api/v1.0/Job/List?idFuncao=100&idCidade=5345&pagina=65&pesquisa=&ordenacao=1&idUsuario=",

"scheme": "https",
"accept": "*/*",
"accept-language": "en-US,en;q=0.9,bn;q=0.8",

"dnt": "1",
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",

}

r = scrapy.Request(url, headers=headers)
fetch(r)
2021-01-22 00:30:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.trabalhabrasil.com.br/api/v1.0/Job/List?idFuncao=0&idCidade=5345&pagina=100&pesquisa=&ordenacao=1&idUsuario=> (referer: https://www.trabalhabrasil.com.br/vagas-empregos-em-sao-paulo-sp)


In [19]: response.status
Out[19]: 200

最新更新