我正在尝试对包含时间序列数据的pandas DataFrame进行插值。我有temp
的每小时数据,我想在半小时点对temp
的值进行插值。这样,我估计每天每个交易时段的temp
,即每天24小时,因此每天48个交易时段。
我的MWE是
import numpy as np
import pandas as pd
from datetime import datetime, date, timedelta
import pyarrow as pa
import pyarrow.parquet as pq
# my dataset
df = pd.DataFrame()
d1 = '2020-10-21'
d2 = '2020-10-22'
df['date'] = pd.to_datetime([d1]*24+[d2]*24, format='%Y-%m-%d')
df['time'] = pd.date_range(d1, periods=len(df), freq='H').time
df['temp'] = pd.DataFrame((50+20*np.sin(np.linspace(0,0.91*np.pi,len(df))))).values
# combine time and date
df.loc[:,'datetime'] = pd.to_datetime(df.date.astype(str)+' '+df.time.astype(str))
df = df.drop(['date','time'], axis=1)
df = df.set_index('datetime')
# trading period
df['tp'] = pd.DataFrame(df.index.hour.values*2+1).values
# interpolate to find temp and datetime for trading periods 2,4,6,...
for n in df.tp.values:
df.loc[-1,'tp'] = n+1
df = df.sort_values('tp').reset_index(drop=True)
#df = df.interpolate(method='linear')
print(df.head(10))
我正在修改这篇文章中的答案,但我得到了错误TypeError: value should be a 'Timestamp' or 'NaT'. Got 'int' instead.
。我怀疑这是由于df.loc[-1,'tp'] = n+1
行造成的,但不确定如何修复。
尝试:
df = df.resample('30T').mean().interpolate()
df['tp'] = ((df.index.hour * 60 + df.index.minute) / 30 + 1).astype(int)
尝试asfreq
,然后尝试interpolate
:
In [36]: df.asfreq('30T').interpolate()
Out[36]:
temp tp
datetime
2020-10-21 00:00:00 50.000000 1.0
2020-10-21 00:30:00 50.607891 2.0
2020-10-21 01:00:00 51.215782 3.0
2020-10-21 01:30:00 51.821424 4.0
2020-10-21 02:00:00 52.427066 5.0
... ... ...
2020-10-22 21:00:00 57.869280 43.0
2020-10-22 21:30:00 57.303145 44.0
2020-10-22 22:00:00 56.737010 45.0
2020-10-22 22:30:00 56.158416 46.0
2020-10-22 23:00:00 55.579822 47.0
[95 rows x 2 columns]