我正试图从pandas数据帧中删除重复的时间序列数据:
import numpy as np
import pandas as pd
# original data
df = pd.DataFrame()
np.random.seed(0)
days = pd.to_datetime(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04', '2015-01-04', '2015-01-05', '2015-01-06', '2015-01-06', '2015-01-07', '2015-01-08'])
df = pd.DataFrame({'Date': days, 'col1': np.random.randn(len(days))})
df = df.set_index('Date')
#df = df.drop_duplicates(subset='df.index')
print(df)
# remove duplicates, keep first instance
n = np.where(df.index.duplicated())[0]
print(n)
df0 = df.drop(df.iloc[n.tolist()])
print(df0)
drop_duplicates
命令不起作用,所以我尝试使用iloc
,这会导致以下错误:
KeyError: "['col1'] not found in axis"
尝试:
print(df[~df.index.duplicated()])
打印:
col1
Date
2015-01-01 1.764052
2015-01-02 0.400157
2015-01-03 0.978738
2015-01-04 2.240893
2015-01-05 -0.977278
2015-01-06 0.950088
2015-01-07 -0.103219
2015-01-08 0.410599
您可以使用:
df.reset_index().drop_duplicates(subset='Date').set_index('Date')
输出:
col1
Date
2015-01-01 1.764052
2015-01-02 0.400157
2015-01-03 0.978738
2015-01-04 2.240893
2015-01-05 -0.977278
2015-01-06 0.950088
2015-01-07 -0.103219
2015-01-08 0.410599