Scrapy Python spider:将结果存储在Latin-1中，而不是unicode中

目前，我的spider根据需要获取结果，但使用unicode（我相信是UTF-8）进行编码。当我把这些结果保存到csv中时，我有大量的清理工作要做，包括Scrapy插入的所有[u'&其他字符。

我将如何准确地将结果存储为拉丁字符，&而不是unicode。我到底需要在哪里进行更改？

谢谢。-TM

提取的item_extracted的类型为unicode。您可以将其编码为提取位置的拉丁文（在解析函数中），也可以在项目管道或输出处理器中进行编码

最简单的方法是将这一行添加到解析函数中

item_to_be_stored = item_extracted.encode('latin-1','ignore')

或者，您可以在项类中定义一个函数。

from scrapy.utils.python import unicode_to_str
def u_to_str(text):
    unicode_to_str(text,'latin-1','ignore')
class YourItem(Item):
    name = Field(output_processor=u_to_str())

如果你的问题是你所说的，那么解决方案就像转换为字符串一样简单。

>>> a = u'spam and eggs'
>>> a
u'spam and eggs'
>>> type(a)
<type 'unicode'>
>>> b = str(a)
>>> b
'spam and eggs'
>>> type(b)
<type 'str'>

编辑：知道可能会发生异常，最好将其封装在尝试中，但除外

try:
    str(a)
except UnicodeError:
    print "Skipping string %s" % a

相关内容

最新更新

热门标签：