我有一个数据帧,它有一个名为DateTime的列,日期时间值每5秒填充一次。但缺少的行很少,可以通过查看前一行和当前行之间的时间差来识别。我想插入缺失的行,并用以前的行值填充其他列。
我的示例数据帧如下:
DateTime Price
2022-03-04 09:15:00 34526.00
2022-03-04 09:15:05 34487.00
2022-03-04 09:15:10 34470.00
2022-03-04 09:15:20 34466.00
2022-03-04 09:15:45 34448.00
结果数据帧如下:
DateTime Price
2022-03-04 09:15:00 34526.00
2022-03-04 09:15:05 34487.00
2022-03-04 09:15:10 34470.00
2022-03-04 09:15:15 34470.00 <----Insert Row and keep Price same as previous row
2022-03-04 09:15:20 34466.00
2022-03-04 09:15:25 34466.00 <----Insert Row and keep Price same as previous row
2022-03-04 09:15:30 34466.00 <----Insert Row and keep Price same as previous row
2022-03-04 09:15:35 34466.00 <----Insert Row and keep Price same as previous row
2022-03-04 09:15:40 34466.00 <----Insert Row and keep Price same as previous row
2022-03-04 09:15:45 34448.00
使用外部联接的替代方案:
t = pd.date_range(df.DateTime.min(), df.DateTime.max(), freq="5s", name="DateTime")
pd.merge(pd.DataFrame(t), df, how="outer").ffill()
输出:
Out[3]:
DateTime Price
0 2022-03-04 09:15:00 34526.0
1 2022-03-04 09:15:05 34487.0
2 2022-03-04 09:15:10 34470.0
3 2022-03-04 09:15:15 34470.0
4 2022-03-04 09:15:20 34466.0
5 2022-03-04 09:15:25 34466.0
6 2022-03-04 09:15:30 34466.0
7 2022-03-04 09:15:35 34466.0
8 2022-03-04 09:15:40 34466.0
9 2022-03-04 09:15:45 34448.0
先尝试resample
,然后尝试ffill
:
df['DateTime'] = pd.to_datetime(df['DateTime']) # change to datetime dtype
df = df.set_index('DateTime') # move DateTime into index
df_out = df.resample('5S').ffill() # resample 5 secs and forward fill
输出:
Price
DateTime
2022-03-04 09:15:00 34526.0
2022-03-04 09:15:05 34487.0
2022-03-04 09:15:10 34470.0
2022-03-04 09:15:15 34470.0
2022-03-04 09:15:20 34466.0
2022-03-04 09:15:25 34466.0
2022-03-04 09:15:30 34466.0
2022-03-04 09:15:35 34466.0
2022-03-04 09:15:40 34466.0
2022-03-04 09:15:45 34448.0
pandas-freq方法就足够了:
(df
.set_index("DateTime")
.asfreq(freq="5S", method="ffill")
.reset_index()
)
DateTime Price
0 2022-03-04 09:15:00 34526.0
1 2022-03-04 09:15:05 34487.0
2 2022-03-04 09:15:10 34470.0
3 2022-03-04 09:15:15 34470.0
4 2022-03-04 09:15:20 34466.0
5 2022-03-04 09:15:25 34466.0
6 2022-03-04 09:15:30 34466.0
7 2022-03-04 09:15:35 34466.0
8 2022-03-04 09:15:40 34466.0
9 2022-03-04 09:15:45 34448.0
另一个选项:
-
创建一个具有所需日期范围的新数据帧
df_2 = pd.DataFrame({ "DateTime": pd.date_range(start=df.loc[0, "DateTime"], end=df.loc[len(df.index)-1, "DateTime"], freq="5s") })
-
使用外部连接合并新的和原始的数据帧
df = pd.merge(df, df_2, how="outer").sort_values("DateTime")
-
使用
.fillna(method="ffill")
填充空值df.fillna(method="ffill")
输出:
DateTime Price
0 2022-03-04 09:15:00 34526.0
1 2022-03-04 09:15:05 34487.0
2 2022-03-04 09:15:10 34470.0
5 2022-03-04 09:15:15 34470.0
3 2022-03-04 09:15:20 34466.0
6 2022-03-04 09:15:25 34466.0
7 2022-03-04 09:15:30 34466.0
8 2022-03-04 09:15:35 34466.0
9 2022-03-04 09:15:40 34466.0
4 2022-03-04 09:15:45 34448.0
结果代码:
df_2 = pd.DataFrame({
"DateTime": pd.date_range(start=df.loc[0, "DateTime"], end=df.loc[len(df.index)-1, "DateTime"], freq="5s")
})
df = pd.merge(df, df_2, how="outer").sort_values("DateTime")
df = df.fillna(method="ffill")
print(df)