我有一个> 250k 行的数据帧,我想计算滚动回归斜率。我可以使用以下代码来完成,但需要一分钟多的时间。我能做些什么来加快速度吗?
import pandas as pd
from datetime import datetime
from scipy.stats import linregress
# Some data
df = pd.DataFrame({'y':np.random.normal(0,1,250000)})
def compute_slope(y):
output = linregress(list(range(len(y))), y)
return output.slope
start = datetime.now()
df['slopes'] = df['y'].rolling(window=15).apply(compute_slope)
print(f"Duration of rolling slopes = {datetime.now() - start}")
Out[12]: Duration of rolling slopes = 0:01:06.327182
使用np.polyfit
和as_strided
你可以做这样的事情:
from numpy.lib.stride_tricks import as_strided
window = 15
ys = df.y.to_numpy()
stride = ys.strides
slopes, intercepts = np.polyfit(np.arange(window),
as_strided(ys, (len(df)-window+1, window),
stride+stride).T,
deg=1)
性能:
CPU times: user 148 ms, sys: 9.86 ms, total: 157 ms