将TSV文件中的一列加载到python列表中



我想从;类别";这是我的tsv文件:

Tagname   text  category
j245qzx_8   hamburger toppings   f
h833uio_7   side of fries   f
d423jin_2   milkshake combo   d

这是我的代码:

with open(filename, 'r') as f:
df = pd.read_csv(f, sep='t')
categoryColumn = df["category"]
categoryList = []
for line in categoryColumn:
categoryColumn.append(line)

然而,对于df = pd.read_csv(f, sep='t')行,我得到一个UnicodeDecodeError,我的代码到此为止:

File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 440, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 787, in __init__
self._make_engine(self.engine)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1014, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1708, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 539, in pandas._libs.parsers.TextReader.__cinit__
File "pandas/_libs/parsers.pyx", line 737, in pandas._libs.parsers.TextReader._get_header
File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2101, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 898: invalid start byte

有什么想法为什么或如何解决这个问题吗?在我的tsv中似乎没有任何特殊的字符,所以我不确定是什么导致了这种情况,也不知道该怎么办。

修复

所以,只要读一下这篇文章,我想我就明白出了问题。

您将使用Python的open()获得一个文件句柄,并将其传递给Pandas的read_csv()open()确定文件的编码。

因此,尝试在open()中设置编码,如下所示:

with open(filename, 'r', encoding='windows-1252') as f:
df = pd.read_csv(f, sep='t')
categoryColumn = df["category"]
categoryList = []
for line in categoryColumn:
categoryColumn.append(line)

或者,根本不使用open()

df = pd.read_csv(filename, sep='t', encoding='windows-1252')
categoryColumn = df["category"]
categoryList = []
for line in categoryColumn:
categoryColumn.append(line)

一些背景故事

我将x89回声到样本的末尾,然后运行Python的chardetect实用程序,这表明它是Window-1252:

% echo -e 'x89' >> sample.csv
% cat sample.csv 
Tagname text    category
j245qzx_8       hamburger toppings      f
h833uio_7       side of fries   f
d423jin_2       milkshake combo d
�
% which chardetect
/Library/Frameworks/Python.framework/Versions/3.9/bin/chardetect
% chardetect sample.csv 
sample.csv: Windows-1252 with confidence 0.73

最新更新