在数据帧中插入缺失的行,并用其他列的前一行值填充



我有一个数据帧,它有一个名为DateTime的列,日期时间值每5秒填充一次。但缺少的行很少,可以通过查看前一行和当前行之间的时间差来识别。我想插入缺失的行,并用以前的行值填充其他列。

我的示例数据帧如下:

DateTime       Price
2022-03-04 09:15:00    34526.00
2022-03-04 09:15:05    34487.00
2022-03-04 09:15:10    34470.00
2022-03-04 09:15:20    34466.00
2022-03-04 09:15:45    34448.00

结果数据帧如下:

DateTime       Price
2022-03-04 09:15:00    34526.00
2022-03-04 09:15:05    34487.00
2022-03-04 09:15:10    34470.00
2022-03-04 09:15:15    34470.00 <----Insert Row and keep Price same as previous row
2022-03-04 09:15:20    34466.00
2022-03-04 09:15:25    34466.00 <----Insert Row and keep Price same as previous row
2022-03-04 09:15:30    34466.00 <----Insert Row and keep Price same as previous row
2022-03-04 09:15:35    34466.00 <----Insert Row and keep Price same as previous row
2022-03-04 09:15:40    34466.00 <----Insert Row and keep Price same as previous row
2022-03-04 09:15:45    34448.00

使用外部联接的替代方案:

t = pd.date_range(df.DateTime.min(), df.DateTime.max(), freq="5s", name="DateTime")
pd.merge(pd.DataFrame(t), df, how="outer").ffill()

输出:

Out[3]:
DateTime    Price
0 2022-03-04 09:15:00  34526.0
1 2022-03-04 09:15:05  34487.0
2 2022-03-04 09:15:10  34470.0
3 2022-03-04 09:15:15  34470.0
4 2022-03-04 09:15:20  34466.0
5 2022-03-04 09:15:25  34466.0
6 2022-03-04 09:15:30  34466.0
7 2022-03-04 09:15:35  34466.0
8 2022-03-04 09:15:40  34466.0
9 2022-03-04 09:15:45  34448.0

先尝试resample,然后尝试ffill:

df['DateTime'] = pd.to_datetime(df['DateTime']) # change to datetime dtype
df = df.set_index('DateTime')                   # move DateTime into index 
df_out = df.resample('5S').ffill()              # resample 5 secs and forward fill

输出:

Price
DateTime                    
2022-03-04 09:15:00  34526.0
2022-03-04 09:15:05  34487.0
2022-03-04 09:15:10  34470.0
2022-03-04 09:15:15  34470.0
2022-03-04 09:15:20  34466.0
2022-03-04 09:15:25  34466.0
2022-03-04 09:15:30  34466.0
2022-03-04 09:15:35  34466.0
2022-03-04 09:15:40  34466.0
2022-03-04 09:15:45  34448.0

pandas-freq方法就足够了:

(df
.set_index("DateTime")
.asfreq(freq="5S", method="ffill")
.reset_index()
)
DateTime    Price
0 2022-03-04 09:15:00  34526.0
1 2022-03-04 09:15:05  34487.0
2 2022-03-04 09:15:10  34470.0
3 2022-03-04 09:15:15  34470.0
4 2022-03-04 09:15:20  34466.0
5 2022-03-04 09:15:25  34466.0
6 2022-03-04 09:15:30  34466.0
7 2022-03-04 09:15:35  34466.0
8 2022-03-04 09:15:40  34466.0
9 2022-03-04 09:15:45  34448.0

另一个选项:

  1. 创建一个具有所需日期范围的新数据帧

    df_2 = pd.DataFrame({
    "DateTime": pd.date_range(start=df.loc[0, "DateTime"], end=df.loc[len(df.index)-1, "DateTime"], freq="5s")
    })
    
  2. 使用外部连接合并新的和原始的数据帧

    df = pd.merge(df, df_2, how="outer").sort_values("DateTime")
    
  3. 使用.fillna(method="ffill")填充空值

    df.fillna(method="ffill")
    

输出:

DateTime    Price
0 2022-03-04 09:15:00  34526.0
1 2022-03-04 09:15:05  34487.0
2 2022-03-04 09:15:10  34470.0
5 2022-03-04 09:15:15  34470.0
3 2022-03-04 09:15:20  34466.0
6 2022-03-04 09:15:25  34466.0
7 2022-03-04 09:15:30  34466.0
8 2022-03-04 09:15:35  34466.0
9 2022-03-04 09:15:40  34466.0
4 2022-03-04 09:15:45  34448.0

结果代码:

df_2 = pd.DataFrame({
"DateTime": pd.date_range(start=df.loc[0, "DateTime"], end=df.loc[len(df.index)-1, "DateTime"], freq="5s")
})
df = pd.merge(df, df_2, how="outer").sort_values("DateTime")
df = df.fillna(method="ffill")
print(df)

最新更新