试图提取被一堆类似于附加标签包围的信息。数据代替了"喜剧之夜直播 - 完整剧集"。我用了
response.xpath("//h3/span/text()").extract()
response.xpath('//*[@id="meta"]/h3/span/text()').extract()
查询被提取,但每次我得到一个空列表。通过命令访问数据可能会有一些错误,但是作为初学者,我需要有关如何达到所需目标的指导。
<a id="meta" class="yt-simple-endpoint style-scope ytd-grid-playlist-renderer" href="/watch?v=q1XwumKHSg8&list=PLX18mvVSh-bz3qlgf-uomp8zktOG5Rdj3">
<h3 class="style-scope ytd-grid-playlist-renderer">
<span id="video-title" class="style-scope ytd-grid-playlist-renderer">
Comedy Nights Live - Full Episodes
</span>
</h3>
</a>
这是爬虫文件。
# -*- coding: utf-8 -*-
import scrapy
class YtubeSpider(scrapy.Spider):
name = 'ytube'
allowed_domains = ['www.youtube.com/user/KapilComedyNights/playlists']
start_urls = ['http://www.youtube.com/user/KapilComedyNights/playlists/']
def parse(self, response):
pass
刮擦,蟒蛇2.7!
查看浏览器开发人员工具页面的组成方式。你会看到Youtube正在使用AJAX。直接下载 ajax 数据并解析它们。另请注意匿名访问该网站。
尝试关闭 ajax=0:
https://www.youtube.com/user/KapilComedyNights/playlists/?ajax=0&app=desktop
你会得到不同的回应:
response.xpath('//div[@class="yt-lockup-ontent"]/h3/a/@title').extract()
[u'Comedy - Full Episodes',
u'Comedy - Audio',
u'Comedy Nights Live',
u'Comedy Nights with Kapil - Shorts',
u'Comedy Nights Live - Full Episodes',
u'COMEDY NIGHTS LIVE - FULL EPISODES',
u'Comedy Nights Bachao',
u'COMEDY NIGHTS BACHAO - FULL EPISODES',
u'Comedy Nights Bachao - Full Episodes']