用Python3(Scrapy，BS4)抓取网站确实会产生不完整的数据.不知道为什么

一段时间前，我使用BS4设置了一个刮网器，每天记录一杯威士忌的值

import requests
from bs4 import BeautifulSoup    
def getPrice() -> float:
try:
URL = "https://www.thewhiskyexchange.com/p/2940/suntory-yamazaki-12-year-old"
website = requests.get(URL)
except:
print("ERROR requesting Price")

try:
soup = BeautifulSoup(website.content, 'html.parser')
price = str(soup.find("p", class_="product-action__price").next)
price = float(price[1::])
return price
except:
print("ERROR parsing Price")

这起到了预期的作用。请求包含完整的网站，并提取了正确的值。

我现在正试图用SCRAPY在其他网站上抓取其他威士忌的数据。

我尝试了以下URL：

https://www.thegrandwhiskyauction.com/past-auctions/q-macallan/180-per-page/relevance

https://www.ebay.de/sch/i.html?_sacat=0&LH_ Complete＝1&amp_udlo=&amp_udhi=&amp_samilow=&amp_samihi=&amp_sadis=10&amp_fpos=&LH_SALE_CURRENCY=0&amp_sop＝12&amp_dmd=1&amp_fosrp＝1&amp_nkw＝麦哲伦&rt=nc

import scrapy

class QuotesSpider(scrapy.Spider):
name = "whisky"
def start_requests(self):
user_agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'
urls = [
'https://www.thegrandwhiskyauction.com/past-auctions/q-macallan/180-per-page/relevance',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = f'whisky-{page}.html'
#data = response.css('.itemDetails').getall()
with open(filename, 'wb') as f:
f.write(response.body)

我只是自定义了教程中的基本示例来创建上面的快速原型。然而，它没有返回完整的网站。回复的正文确实遗漏了几个标签，尤其是我要查找的内容。

我试图用BS4再次这样解决这个问题：

import requests
from bs4 import BeautifulSoup
URL = "https://www.thegrandwhiskyauction.com/past-auctions/q-macallan/180-per-page/relevance"
website = requests.get(URL)
soup = BeautifulSoup(website.content, 'html.parser')
with open("whiskeySoup.html", 'w') as f:
f.write(str(soup.body))

令我惊讶的是，这产生了同样的结果。请求及其正文没有包含完整的网站，缺少我想要的所有数据。

我还包含了一个用户代理标头，因为我了解到一些网站可以识别来自机器人和蜘蛛的请求，并且不会提供所有数据。然而，这并没有解决问题。

我无法弄清楚或调试为什么从这些URL请求的数据不完整。有没有办法用SCRAPY来解决这个问题？

许多网站严重依赖javascript来生成网站的最终html页面。当您向服务器发送请求时，它会返回一些脚本web浏览器(如chrome、Firefox和其他浏览器(的html代码，并处理该javascript代码，然后显示您可以看到的最终html。但是，当您使用scratch、request或某些库时，它们不具备执行javascript代码的功能，因此html代码在html中是不同的，并且当爬网程序看到网页时。如果你想看看爬网程序是如何看到网站的(爬网程序看到的网页的html代码(，你可以运行命令"scrapy view｛url｝"，这将在浏览器中打开页面，或者如果你想获得爬网程序看到网页的html码码码，你可以执行命令"scraby fetch｛url}"。当你使用scrapy时，最好在shell中打开url(命令是"scrapy-shell｛url｝"(，然后用xpath或css方法(response.css('some_css'(.css('again_some_css'。若你们想看看你们在shell中得到了什么响应。您只需键入view(response(，它就会打开浏览器中收到的响应。我希望这是清楚的。但是，如果你想在最终处理响应之前处理javascript(必要时(，你可以使用无头浏览器selenium或轻量级web浏览器splash。硒很容易使用。

编辑1。对于第一个url：转到scratchshell并检查css路径div.bidPrice:：text。在里面你会看到里面的内容是动态生成的，没有html代码，内容是动态产生的。

相关内容

最新更新

热门标签：