我想从;类别";这是我的tsv文件:
Tagname text category
j245qzx_8 hamburger toppings f
h833uio_7 side of fries f
d423jin_2 milkshake combo d
这是我的代码:
with open(filename, 'r') as f:
df = pd.read_csv(f, sep='t')
categoryColumn = df["category"]
categoryList = []
for line in categoryColumn:
categoryColumn.append(line)
然而,对于df = pd.read_csv(f, sep='t')
行,我得到一个UnicodeDecodeError,我的代码到此为止:
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 440, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 787, in __init__
self._make_engine(self.engine)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1014, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1708, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 539, in pandas._libs.parsers.TextReader.__cinit__
File "pandas/_libs/parsers.pyx", line 737, in pandas._libs.parsers.TextReader._get_header
File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2101, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 898: invalid start byte
有什么想法为什么或如何解决这个问题吗?在我的tsv中似乎没有任何特殊的字符,所以我不确定是什么导致了这种情况,也不知道该怎么办。
修复
所以,只要读一下这篇文章,我想我就明白出了问题。
您将使用Python的open()
获得一个文件句柄,并将其传递给Pandas的read_csv()
。open()
确定文件的编码。
因此,尝试在open()
中设置编码,如下所示:
with open(filename, 'r', encoding='windows-1252') as f:
df = pd.read_csv(f, sep='t')
categoryColumn = df["category"]
categoryList = []
for line in categoryColumn:
categoryColumn.append(line)
或者,根本不使用open()
:
df = pd.read_csv(filename, sep='t', encoding='windows-1252')
categoryColumn = df["category"]
categoryList = []
for line in categoryColumn:
categoryColumn.append(line)
一些背景故事
我将x89
回声到样本的末尾,然后运行Python的chardetect
实用程序,这表明它是Window-1252:
% echo -e 'x89' >> sample.csv
% cat sample.csv
Tagname text category
j245qzx_8 hamburger toppings f
h833uio_7 side of fries f
d423jin_2 milkshake combo d
�
% which chardetect
/Library/Frameworks/Python.framework/Versions/3.9/bin/chardetect
% chardetect sample.csv
sample.csv: Windows-1252 with confidence 0.73