抓取选择器字符串不接受国际字符

我正在尝试让Scrapy蜘蛛抓取网站，但是我想要的项目所需的元素之一是用西班牙语编写的，使用带有波浪号（í）的元音。

titulo=title.select（u'.//["Título Original："]/text（）'.extract（）

我

在这里发现了类似的问题，但接受的答案对我不起作用。

在字符串开头添加u可以解决一些问题，但给了我错误

UnicodeEncodeError: 'ascii' codec can't encode character u'xed' in position 21: ordinal not in range(128)

我

在这里发现了其他问题，建议使用'.../text（）'.decode（'utf-8），但这样做或使用.encode（'utf-8'）反而给了我错误

    exceptions.ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

我

是否缺少某些东西或其他方式，或者我最好制作一个正则表达式来捕获字符串的所有其他部分，但那封信除外？

这是我到目前为止的代码：

 def parse(self, response):
    #change the response to an HtmlResponse to allow for utf-8 encoding of the body.
response = HtmlResponse(url=response.url, status=response.status, headers=response.headers, body=response.body)
print 'nnresponse encoding', response.encoding ##the page is encoded in utf-8
hxs = HtmlXPathSelector(response)
    titles = hxs.select('//div[@class="datosespectaculo"]')
    items = []
    for title in titles:          
        item = CarteleraItem()
        titulo=title.select(u'.//["Título Original:"]/text()'.encode('utf-8')).extract()
        Ano=title.select('.//span[@itemprop="copyrightYear"]/text').extract()
        item ["title"] = titulo
        item ["Ano"] = Ano   
        items.append(item)

这是网页的来源供参考

<div id="contgeneral">
<div class="contyrasca">
<div id="contfix">
<div class="contespectaculo">
<div class="colizq"><div itemscope itemtype="http://schema.org/Movie">
<h1 class="titulo" itemprop="name">15.361</h1>
<img class="afiche" src="http://www.cartelera.com.uy/imagenes_espectaculos/musicdetail13/14770.jpg"/>
<div class="datosespectaculo">
<strong>Título Original:</strong> <em>15.361</em><br />
<strong>Año: </strong><span itemprop="copyrightYear">2014</span><br />
<strong>Género: </strong><span itemprop="genre">Comedia/Drama</span><br />
<strong>Duración: </strong><span itemprop="duration">60&#39;</span><br />
<strong>Calificación: </strong>+18 años<br />

如果

# -*- coding: utf-8 -*-不起作用，则可以使用 unicode 字符串，其中非 ASCII 字符使用u转义序列。

所以你 XPath 选择器变成：

titulo=title.select(u'.//["Tu00edtulo Original:"]/text()'.encode('utf-8')).extract()

我通常使用一个简单的 Python shell 会话来检查转义序列：

paul@wheezy:~$ python
Python 2.7.3 (default, Jan  2 2013, 13:56:14) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'.//["Título Original:"]/text()'
u'.//["Txedtulo Original:"]/text()'
>>> u'.//["Tu00edtulo Original:"]/text()'
u'.//["Txedtulo Original:"]/text()'
>>>

尝试将以下行添加到 python 文件的开头：

# -*- coding: utf-8 -*-

有关完整说明，请阅读文档。

相关内容

最新更新

热门标签：