404响应在刮擦外壳，在浏览器中不同的结果

我正在抓取网站赔率门户，只是对标题文本的简单查询返回['赔率门户：找不到页面']，但是在浏览器控制台中，这["赔率门户：找不到页面"]没有出现。我注意到当外壳加载时响应是：

[s]   response   <404 https://www.oddsportal.com/darts/europe/european-championship/results/>

在我的终端中

scrapy shell 'https://www.oddsportal.com/darts/europe/european-championship/results/' --set="ROBOTSTXT_OBEY=False"
response.css('title::text').extract()
['OddsPortal: Page not found']

我期待上面的选择器：

欧洲锦标赛结果和历史赔率，飞镖欧洲档案馆

我在运行自己的请求时也会遇到此错误。如此处所示，此站点不允许抓取。我的猜测是他们有一些警卫来阻止你尝试。我成功地使用带有硒的非无头版本。我建议以这种方式进行刮擦。看起来大部分网站都是动态的javascript，所以这是硒的另一个+1。在这个例子中，我正在使用美丽的汤来解析，我强烈推荐它。

from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('https://www.oddsportal.com/darts/europe/european-championship/results/')
soup = BeautifulSoup(driver.page_source, 'html.parser')
print(soup.title.text)
#output
#European Championship Results & Historical Odds, Darts Europe Archive

相关内容

最新更新

热门标签：