剪贴加载更多问题-CSS选择器

我正在尝试抓取一个网站，该网站在页面底部有一个"显示更多"链接，可以获取更多数据。以下是网站页面的链接：https://untappd.com/v/total-wine-more/47792.这是我的完整代码：

class Untap(scrapy.Spider):
name = "Untappd"
allowed_domains = ["untappd.com"]
start_urls = [
'https://untappd.com/v/total-wine-more/47792' #URL: Major liquor store chain with Towson location.
]
def parse(self, response):
for beer_details in response.css('div.beer-details'):
yield {
'name': beer_details.css('h5 a::text').getall(), #Name of Beer
'type': beer_details.css('h5 em::text').getall(), #Style of Beer
'ABVIBUs': beer_details.css('h6 span::text').getall(), #ABV and IBU of Beer
'Brewery': beer_details.css('h6 span a::text').getall() #Brewery that produced Beer  
}
load_more = response.css('a.yellow button more show-more-section track-click::attr(href)').get()
if load_more is not None:
load_more = response.urljoin(load_more)
yield scrapy.Request(load_more, callback=self.parse)

我曾尝试使用底部的"load_more"块来继续加载更多的数据以进行抓取，但来自网站的HTML输入一直不起作用。

这是网站上的HTML。

<a href="javascript:void(0);" class="yellow button more show-more-section track-click" data-track="venue" data-href=":moremenu" data-section-id="140216931" data-venue-id="47792" data-menu-id="38988361">Show More Beers</a>

我想让蜘蛛抓取网站上显示的内容，然后点击链接继续抓取页面。如有任何帮助，我们将不胜感激。

简短回答：

curl 'https://untappd.com/venue/more_menu/47792/15?section_id=140248357' -H 'x-requested-with: XMLHttpRequest'

单击该按钮会执行javascript，因此您需要使用selenium来实现自动化，但幸运的是，您不会：(。

您可以看到，使用开发人员工具，当您单击该按钮时，它会按照显示的模式请求数据，每次增加15(在/47792/之后(，所以第一次：https://untappd.com/venue/more_menu/47792/15?section_id=140248357第二次：https://untappd.com/venue/more_menu/47792/30?section_id=140248357那么：https://untappd.com/venue/more_menu/47792/45?section_id=140248357'等等

但是，如果您尝试直接从浏览器获取它，它不会得到任何内容，因为他们期望的是"x-requested-with:XMLHttpRequest"标头，表明它是AJAX请求。

因此，您就有了对scraper进行编码所需的URL模式和标头。

剩下的就是解析每个响应。：(

PD：section_id参数可能会改变(我的和你的不同(，但你已经在按钮的HTML中有了data-section-id="140248357"属性。

相关内容

最新更新

热门标签：