项目不包括在 Scrapy 中制作的 for 循环中

我认为这个问题可能有一个简单的解决方案......我要做的就是使用我的变量项 ['genre'] 提取列出流派类型的文本，这很简单......但是，由于我正在提取的项目仅出现在我正在抓取一次的页面上，因此在循环浏览其他项目（例如项目 ["艺术家"）时，不包括该项目 ["流派"]。任何帮助将不胜感激。这是我认为是相关代码的内容。

def parse_item(self, response):#http://stackoverflow.com/questions/15836062/scrapy-crawlspider-doesnt-crawl-the-first-landing-page
    for info in response.xpath('//div[@class="entry vevent"] | //div[@id="page"]'):
        item = TutorialItem() # Extract items from the items folder.
        item ['artist'] = info.xpath('.//span[@class="summary"]//text()').extract() # Extract artist information.
        item ['date'] = info.xpath('.//span[@class="dates"]//text()').extract() # Extract date information.
        preview = ''.join(str(s)for s in item['artist'])
        item ['genre'] = info.xpath('.//div[@class="header"]//text()').extract()

真的希望这是有道理的，如果没有，请道歉！

你

只得到一次流派的原因是，response.xpath('//div[@class="entry vevent"] | //div[@id="page"]')的返回列表将包含一个div（id="page"）和一堆div（class="entry vevent"）

在遍历上述列表时，div[@id="page"]将满足流派 XPaPath，

即，这个div包含另一个div，它有一个class="header"

In [1]: a = response.xpath('//div[@class="entry vevent"] | //div[@id="page"]')
In [2]: a[0].xpath('.//div[@class="header"]//text()').extract()
Out[2]: [u'Clubbing Overview']
In [3]: a[1].xpath('.//div[@class="header"]//text()').extract()
Out[3]: []
In [4]: a[2].xpath('.//div[@class="header"]//text()').extract()
Out[4]: []
...

另一方面 div[@class="entry vevent"]，它不包含任何具有 class="header" 的 div，因此最终将导致获取空列表作为输出

有意义？

一种解决方案是将该流派 XPath

置于循环之外，或者您可以将流派的 XPath 修改为

info.xpath('.//div[@class="header"]//text() | ./parent::div[@class="rows"]/preceding-sibling::div[@class="header"]//text()').extract()

我想

你在循环结束时错过了return item

相关内容

最新更新

热门标签：