用SCRAPY擦除特殊字符

我正在用丹麦语刮一页。我的输出有问题。输出包含许多特殊字符，如(Ã¥, Ã, Ã¥, Ã¦)，与页面上的字符不同。

我怎样才能像在页面上一样刮文本？

示例链接：https://novaindex.com/dk/leverandoerer/mode-og-tekstiler/arbejdstoej

import scrapy

class MainSpider(scrapy.Spider):
name = 'main'
start_urls = ['https://novaindex.com/dk/leverandoerer/mode-og-tekstiler/arbejdstoej']
def parse(self, response):
details = response.xpath('//a[@class="companyresult "]')
for each in details:
name = each.xpath('normalize-space(.//span[@class="name"]/text())').get()
street = each.xpath('normalize-space(.//span[@class="street"]/text())').get()
city = each.xpath('normalize-space(.//span[@class="city"]/text())').get()
phone = each.xpath('normalize-space(.//span[@class="phone"]/text())').get()
yield {
"Name": name,
"Street Address": street,
"City Address": city,
"Phone": phone,
}

丹麦编解码器是cp865，请在此处检查所有可用的编解码器

注意：只有在您的英语网站上使用ascii。

def string_cleaner(rouge_text):
return ("".join(rouge_text.strip()).encode('cp865', 'ignore').decode("cp865"))

使用ignore忽略错误

用法

yield {
"Name": string_cleaner(name),
...
}

关于代码的更多解释请在此处查看我的代码分解

您可以在get()或getall()之后添加.encode('utf8')

Scrapy将数据提取为unicode字符串，这可能有助于您了解unicode和UTF-8。

什么是unicode字符串？

相关内容

最新更新

热门标签：