过滤熊猫时间序列最快的方法是什么?



过滤熊猫时间序列最快的方法是什么?现在我使用布尔屏蔽来过滤时间序列ts:

import time
from datetime import datetime
import pandas as pd
import statistics
# create time series
idx = pd.date_range(start='2022-01-01', end='2023-01-01', freq="min")
ts = pd.Series(1, index=idx)
start_dt = datetime(2022, 1, 1, 0, 0, 0)
end_dt = datetime(2022, 1, 2, 0, 0, 0)
time_lst = []
# measure performance of boolean masking
for i in range(100):
start = time.time()
# 1st method
mask = (ts.index > start_dt) & (ts.index <= end_dt)
# 2nd method, nearly same velociy
# mask = np.where((ts.index > start_dt) & (ts.index <= end_dt), True, False)
time_lst.append(time.time() - start)
print(statistics.mean(time_lst))
filtered_ts = ts.loc[mask]

我想知道,如果这已经是最快的方式(这里~0.003秒每运行)或有其他方法?对于不同的start_dtend_dt,我使用了数千次掩码,它的总和是一个重要的时间,我想减少。

你的解决方案真快:

%timeit ts[(ts.index > start_dt) & (ts.index <= end_dt)]
5.02 ms ± 413 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit ts[ts.index.to_series().between(start_dt, end_dt, inclusive='left')]
8.22 ms ± 160 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

但如果可能的话,使用Series.loc更改包含两个日期时间的解决方案会更快:

%timeit ts[(ts.index >= start_dt) & (ts.index <= end_dt)]
%timeit ts.loc[start_dt:end_dt]
138 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

相关内容

最新更新