XPath 语句未按预期解析



下面是我尝试从中选择2016的HTML。

<span id="titleYear">
"("
<a href="/year/2016/?ref_=tt_ov_inf">2016</a>
")"
</span>

这是XPath语句://span[@id='titleYear']/a/text()

不幸的是,出于某种原因,该声明选择了<a href="/year/2016/?ref_=tt_ov_inf">2016</a>

//span[@id='titleYear']/a/text()返回与//span[@id='titleYear']/a//span[@id='titleYear']/a[text()]相同的结果。

为什么在这种情况下text()效果为零?

是因为2016不是文本节点吗?

值得注意的是,我在Python 3.6.5和Scrapy 1.5.0中使用Anaconda。

蟒蛇脚本

import scrapy
class IMDBcrawler(scrapy.Spider):
name = 'imdb'
def start_requests(self):
pages = []
count = 1
limit = 10
while (count <= limit):
str_number = '%07d' % count
pages.append('https://www.imdb.com/title/tt' + str_number)
count += 1
for url in pages:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
yield {
'nom': response.xpath('//div[@class="title_wrapper"]/h1/text()').extract_first(),
'ano': response.xpath('//span[@id="titleYear"]/a/text()').extract_first(),
}

输出

[
{
"nom": "Chinese Opium Denu00a0",
"ano": "<a href="/year/1894/?ref_=tt_ov_inf">1894</a>"
},
{
"nom": "Pauvre Pierrotu00a0",
"ano": "<a href="/year/1892/?ref_=tt_ov_inf">1892</a>"
},
{
"nom": "Carmencitau00a0",
"ano": "<a href="/year/1894/?ref_=tt_ov_inf">1894</a>"
},
{
"nom": "Un bon bocku00a0",
"ano": "<a href="/year/1892/?ref_=tt_ov_inf">1892</a>"
},
{
"nom": "Blacksmith Sceneu00a0",
"ano": "<a href="/year/1893/?ref_=tt_ov_inf">1893</a>"
},
{
"nom": "Corbett and Courtney Before the Kinetographu00a0",
"ano": "<a href="/year/1894/?ref_=tt_ov_inf">1894</a>"
},
{
"nom": "Employees Leaving the Lumiu00e8re Factoryu00a0",
"ano": "<a href="/year/1895/?ref_=tt_ov_inf">1895</a>"
},
{
"nom": "Miss Jerryu00a0",
"ano": "<a href="/year/1894/?ref_=tt_ov_inf">1894</a>"
},
{
"nom": "Le clown et ses chiensu00a0",
"ano": "<a href="/year/1892/?ref_=tt_ov_inf">1892</a>"
},
{
"nom": "Edison Kinetoscopic Record of a Sneezeu00a0",
"ano": "<a href="/year/1894/?ref_=tt_ov_inf">1894</a>"
}
]

谢谢。

不确定使用Scrapy的问题是什么,但是在请求的帮助下直接使用lxml,使用findtext的更简单xpath工作正常:

import requests
from lxml import html
pages = []
for count in range(1, 10):
str_num = '%07d' % count
res = html.fromstring(requests.get('https://www.imdb.com/title/tt' + str_num).text)
pages.append({'nom': res.findtext('.//div[@class="title_wrapper"]/h1'), 'ano': res.findtext('.//span[@id="titleYear"]/a')})

结果:

In [40]: pages
Out[40]:
[{'ano': '1894', 'nom': 'Carmencitaxa0'},
{'ano': '1892', 'nom': 'Le clown et ses chiensxa0'},
{'ano': '1892', 'nom': 'Pauvre Pierrotxa0'},
{'ano': '1892', 'nom': 'Un bon bockxa0'},
{'ano': '1893', 'nom': 'Blacksmith Scenexa0'},
{'ano': '1894', 'nom': 'Chinese Opium Denxa0'},
{'ano': '1894', 'nom': 'Corbett and Courtney Before the Kinetographxa0'},
{'ano': '1894', 'nom': 'Edison Kinetoscopic Record of a Sneezexa0'},
{'ano': '1894', 'nom': 'Miss Jerryxa0'}]

最新更新