Python 熊猫在尝试访问大型数据集上的列时产生错误'DATE'



我有一个3'502'379行和3列的文件。应该执行以下脚本,但在日期处理行中引发错误:

import matplotlib.pyplot as plt
import numpy as np
import csv
import pandas
path = 'data_prices.csv'
data = pandas.read_csv(path, sep=';')
data['DATE'] = pandas.to_datetime(data['DATE'], format='%Y%m%d')

这是发生的错误:

Traceback (most recent call last):
  File "C:Program FilesPython35libsite-packagespandasindexesbase.py", line 1945, in get_loc
    return self._engine.get_loc(key)
  File "pandasindex.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandasindex.c:4066)
  File "pandasindex.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandasindex.c:3930)
  File "pandashashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:12408)
  File "pandashashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:12359)
KeyError: 'DATE'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "C:datascript.py", line 15, in <module>
    data['DATE'] = pandas.to_datetime(data['DATE'], format='%Y%m%d')
  File "C:Program FilesPython35libsite-packagespandascoreframe.py", line 1997, in __getitem__
    return self._getitem_column(key)
  File "C:Program FilesPython35libsite-packagespandascoreframe.py", line 2004, in _getitem_column
    return self._get_item_cache(key)
  File "C:Program FilesPython35libsite-packagespandascoregeneric.py", line 1350, in _get_item_cache
    values = self._data.get(item)
  File "C:Program FilesPython35libsite-packagespandascoreinternals.py", line 3290, in get
    loc = self.items.get_loc(item)
  File "C:Program FilesPython35libsite-packagespandasindexesbase.py", line 1947, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandasindex.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandasindex.c:4066)
  File "pandasindex.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandasindex.c:3930)
  File "pandashashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:12408)
  File "pandashashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandashashtable.c:12359)
KeyError: 'DATE'

第一个列名中的'ufeffDATE'表示您的CSV文件具有UTF-16字节顺序标记(BOM)签名,因此必须相应地读取。

所以在阅读CSV文件时试试这个:

df = pd.read_csv(path, sep=';', encoding='utf-8-sig')

或者像@EdChum建议的那样:

df = pd.read_csv(path, sep=';', encoding='utf-16')

两个变量应该都能正常工作

PS这个答案展示了如何处理bom

最新更新