从csv中读取俄语数据



我在CSV文件中有一些数据是俄语的:

2-комнатная квартира РДТ',  мкр Тастак-3,  Аносова — Толе би;Алматы
2-комнатная квартира БГР',  мкр Таугуль,  Дулати (Навои) — Токтабаева;Алматы
2-комнатная квартира ЦФМ',  мкр Тастак-2,  Тлендиева — Райымбека;Алматы

分隔符为;符号。


我想读取数据并将其放入数组。我试图使用以下代码读取这些数据:

def loadCsv(filename):
    lines = csv.reader(open(filename, "rb"),delimiter=";" )
    dataset = list(lines)
    for i in range(len(dataset)):
        dataset[i] = [str(x) for x in dataset[i]]
    return dataset

然后我阅读并打印结果:

mydata = loadCsv('krish(csv3).csv')
print mydata

输出:

[['2-xeaxeexecxedxe0xf2xedxe0xff xeaxe2xe0xf0xf2xe8xf0xe0,  xecxeaxf0 xd2xe0xf1xf2xe0xea-3,  xc0xedxeexf1xeexe2xe0 x97 xd2xeexebxe5 xe1xe8', 'xc0xebxecxe0xf2xfb'], ['2-xeaxeexecxedxe0xf2xedxe0xff xeaxe2xe0xf0xf2xe8xf0xe0,  xecxeaxf0 xd2xe0xf3xe3xf3xebxfc,  xc4xf3xebxe0xf2xe8 (xcdxe0xe2xeexe8) x97 xd2xeexeaxf2xe0xe1xe0xe5xe2xe0', 'xc0xebxecxe0xf2xfb'], ['2-xeaxeexecxedxe0xf2xedxe0xff xeaxe2xe0xf0xf2xe8xf0xe0,  xecxeaxf0 xd2xe0xf1xf2xe0xea-2,  xd2xebxe5xedxe4xe8xe5xe2xe0 x97 xd0xe0xe9xfbxecxe1xe5xeaxe0', 'xc0xebxecxe0xf2xfb']]

我发现在这种情况下需要编解码器,并试图用这个代码做同样的事情:

import codecs
with codecs.open('krish(csv3).csv','r',encoding='utf8') as f:
    text = f.read()
print text

我得到了这个错误:

newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 2: invalid continuation byte

问题出在哪里?使用编解码器时,如何在数据中指定分隔符?我只想从文件中读取数据,并将其放入二维数组中。

eaк的windows-1251/cp5347编码。因此,您需要使用windows-1251解码,而不是UTF-8。

在Python 2.7中,CSV库不正确支持Unicode-请参阅中的"Unicode"https://docs.python.org/2/library/csv.html

他们提出了一个简单的解决方法:

class UnicodeReader:
    """
    A CSV reader which will iterate over lines in the CSV file "f",
    which is encoded in the given encoding.
    """
    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        f = UTF8Recoder(f, encoding)
        self.reader = csv.reader(f, dialect=dialect, **kwds)
    def next(self):
        row = self.reader.next()
        return [unicode(s, "utf-8") for s in row]
    def __iter__(self):
        return self

这将允许你做:

def loadCsv(filename):
    lines = UnicodeReader(open(filename, "rb"), delimiter=";", encoding="windows-1251" )
    # if you really need lists then uncomment the next line
    # this will let you do call exact lines by doing `line_12 = lines[12]`
    # return list(lines)
    # this will return an "iterator", so that the file is read on each call
    # use this if you'll do a `for x in x`
    return lines 

如果您尝试打印dataset,那么您将获得列表中列表的表示,其中第一个列表是行,第二个列表是列。任何编码的字节或文字都将用xu表示。要打印值,请执行:

for csv_line in loadCsv("myfile.csv"):
    print u", ".join(csv_line)

如果你需要将结果写入另一个文件(相当典型),你可以这样做:

with io.open("my_output.txt", "w", encoding="utf-8") as my_ouput:
    for csv_line in loadCsv("myfile.csv"):
        my_output.write(u", ".join(csv_line))

这将自动将您的输出转换/编码为UTF-8。

您不能尝试:

import pandas as pd 
pd.read_csv(path_file , "cp1251")

import csv
with open(path_file,  encoding="cp1251", errors='ignore') as source_file:
        reader = csv.reader(source_file, delimiter=",") 

您的.csv是否可以是另一种编码,而不是UTF-8?(考虑到错误消息,它甚至应该)。尝试其他cyrillic编码,如Windows-1251、CP866或KOI8。

在py3:中

import csv
path = 'C:/Users/me/Downloads/sv.csv'
with open(path, encoding="UTF8") as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

最新更新