尝试将 Unicode 结果从网页抓取写入 CSV 时出错



我正在使用一个网页抓取脚本(在GitHub上找到(,并将结果写入.csv文件。一些结果(用户评论(是用日语或俄语编写的,因此我想将 unicode 写入.csv文件。

当我只使用 csv 模块时,代码工作正常,但这不会将 unicode 写入 csv。

这是我用于网络抓取的代码的一部分:

with open(datafile, 'w', newline='', encoding='utf8') as csvfile:
# Tab delimited to allow for special characters
datawriter = csv.writer(csvfile, delimiter=',')
print('Processing..')
for i in range(1,pages+1):
# Sleep if throttle enabled
if(throttle): time.sleep(sleepTime)
page = requests.get(reviewPage + '&page=' + str(i))
tree = html.fromstring(page.content)
# Each item below scrapes a pages review titles, bodies, ratings and languages. 
titles = tree.xpath('//a[@class="review-title-link"]')
bodies = tree.xpath('//div[@class="review-body"]')
ratings = tree.xpath('//div[@data-status]')
langs = tree.xpath("//h3[starts-with(@class, 'review-title')]")
dates = tree.xpath("//time[@datetime]")
for idx,e in enumerate(bodies):
# Title of comment
title = titles[idx].text_content()
# Body of comment
body = e.text_content().strip()
# The rating is the 5th from last element
rating = ratings[idx].get('data-status').split(' ')[-5] 
# Language is 2nd element of h3 tag
lang = langs[idx].get('class').split(' ')[1]
#Date
date = dates[idx].get("datetime").split('T')[0]
datawriter.writerow([title,body,rating,lang,date])
print('Processed ' + str(ratingCount) + '/' + str(ratingCount) + ' ratings.. Finished!')

我试图import unicodecsv as csv但这引发了一个类型错误:

TypeError                                 Traceback (most recent call last)
<ipython-input-4-2db937260285> in <module>()
44             date = dates[idx].get("datetime").split('T')[0]
45 
---> 46             datawriter.writerow([title,body,rating,lang,date])
47     print('Processed ' + str(ratingCount) + '/' + str(ratingCount) + ' ratings.. Finished!')
~libsite-packagesunicodecsvpy3.py in writerow(self, row)
26 
27     def writerow(self, row):
---> 28         return self.writer.writerow(row)
29 
30     def writerows(self, rows):
C:UsersEbelAnaconda3libsite-packagesunicodecsvpy3.py in write(self, string)
13 
14     def write(self, string):
---> 15         return self.binary.write(string.encode(self.encoding, self.errors))
16 
17 
TypeError: write() argument must be str, not bytes

我想解决这个问题。提前感谢!

由于unicodecsv正在写入字节而不是字符串,因此您希望在binary modeopen()文件。请注意,binary mode不需要编码,您必须删除encoding参数。

with open(datafile, 'w', newline='', encoding='utf8') as csvfile:

然后变成:

with open(datafile, 'wb', newline='') as csvfile:

'wb'中的b表示您要写入字节而不是字符串。

让评论成为答案。

您的with对于 Python 3 是正确的,而 Python 2 只需要unicodecsv。只需import csv(使用内置的(。 在 Windows 上,使用encoding='utf-8-sig'。如果没有 BOM 签名,Windows 记事本将无法正确显示 UTF-8 文件,Excel 也无法正确读取它。

最新更新