我的数据框架看起来像这样:
df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8], 'price':[4.95, 5.04, 4.88, 4.22, 5.67, 5.89, 5.50, 5.12]})
pd.set_option('display.max_Columns', None)
for lag in range(1,7):
df[f'price_lag{lag}M'] = df['price'].shift(lag)
print(df)
>>
date price price_lag1M price_lag2M price_lag3M price_lag4M
0 1 4.95 NaN NaN NaN NaN
1 2 5.04 4.95 NaN NaN NaN
2 3 4.88 5.04 4.95 NaN NaN
3 4 4.22 4.88 5.04 4.95 NaN
4 5 5.67 4.22 4.88 5.04 4.95
5 6 5.89 5.67 4.22 4.88 5.04
6 7 5.50 5.89 5.67 4.22 4.88
7 8 5.12 5.50 5.89 5.67 4.22
price_lag5M price_lag6M
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 4.95 NaN
6 5.04 4.95
7 4.88 5.04
我想计算每个月的滞后斜率。我一直在用np。polyfit,虽然它相当快,但如果行中至少有一个NaN,它会给我NaN。
X = [1,2,3,4,5,6]
vars_to_consider = [f'price_lag{i}M' for i in range(1,7)]
Y = df.loc[:, vars_to_consider].values.T
df.loc[:, 'price_trend_6M'] = np.polyfit(X,Y,1)[0].round(4)
df = df.drop(vars_to_consider, axis=1)
print(df)
>>
date price price_trend_6M
0 1 4.95 NaN
1 2 5.04 NaN
2 3 4.88 NaN
3 4 4.22 NaN
4 5 5.67 NaN
5 6 5.89 NaN
6 7 5.50 -0.1694
7 8 5.12 -0.1937
我想计算任何非空值的斜率,忽略空值,但对于所有行。对于像这样的小数据,我会这样做:
vars_to_consider = [f'price_lag{i}M' for i in range(1,7)]
for i in range(len(df)):
Y = df.loc[i, vars_to_consider].values
idx = np.where(~np.isnan(Y))[0]
if len(idx) < 2:
df.loc[i, 'price_trend_6M'] = np.nan
else:
df.loc[i, 'price_trend_6M'] = np.polyfit(np.arange(len(idx)), Y[idx], 1)[0].round(4)
df = df.drop(vars_to_consider, axis=1)
print(df)
>>
month price price_trend_6M
0 1 4.95 NaN
1 2 5.04 NaN
2 3 4.88 -0.0900
3 4 4.22 0.0350
4 5 5.67 0.2350
5 6 5.89 -0.0620
6 7 5.50 -0.1694
7 8 5.12 -0.1937
然而,原始数据帧大约有300k行长,并且有大约80个变量,如"价格",我想计算趋势。所以第二种方法耗时太长。是否有更快的方法来实现第二个输出?
认识到,由于您的最大shift
是6行,np.polyfit
将仅为前六行返回nan
。您可以继续对整个数据框使用np.polyfit
,然后简单地遍历前六行来纠正它们。因为您知道您将只迭代固定的、少量的行,这将比迭代所有行要快得多,就像您在第二段代码中展示的那样。
# Vectorized call for the entire DF
# Note that X needs to be an array for the mask in the loop below to work
X = np.array([1,2,3,4,5,6])
vars_to_consider = [f'price_lag{i}M' for i in range(1,7)]
Y = df.loc[:, vars_to_consider].values.T
df.loc[:, 'price_trend_6M'] = np.polyfit(X,Y,1)[0].round(4)
# Fix first six rows
for i, row in df.head(len(X)).iterrows():
ydata = row.loc[vars_to_consider].values
mask = ~np.isnan(ydata) # Don't need `np.where` if we use boolean indexing
if mask.sum() >= 2: # If >= 2 points, make a polyfit
df.loc[i, 'price_trend_6M'] = np.polyfit(X[mask],ydata[mask],1)[0].round(4)
df = df.drop(vars_to_consider, axis=1)
给出你想要的结果:
date price price_trend_6M
0 1 4.95 NaN
1 2 5.04 NaN
2 3 4.88 -0.0900
3 4 4.22 0.0350
4 5 5.67 0.2350
5 6 5.89 -0.0620
6 7 5.50 -0.1694
7 8 5.12 -0.1937
@Pranav的回答很好,可以解决我所提出的问题。我的原始数据有多个id,这些id有多个日期和价格,所以它并不总是前6行。但是,可以用非空值手动计算斜率的行要比总行少得多。这就是我最终使用的:
X = [1,2,3,4,5,6]
vars_to_consider = [f'price_lag{i}M' for i in range(1,7)]
Y = df.loc[:, vars_to_consider].values
df.loc[:, 'price_trend_6M'] = np.polyfit(X,Y.T,1)[0].round(4)
# Select indices where Y is not null
idx = ~np.isnan(Y)
# Count which rows have 2 to 5 nulls, since these rows need mending
idx2 = (idx.sum(axis=1) >= 2) & (idx.sum(axis=1) <= 5)
# Run a for loop with these rows, and calculate slopes with non-null values
for i in np.where(idx2)[0]:
y = Y[i][~np.isnan(Y[i])]
x = np.arange(len(y))
df.loc[i, 'price_trend_6M'] = np.polyfit(x,y,1)[0].round(4)
df = df.drop(vars_to_consider, axis=1)
print(df)
>>
month price price_trend_6M
0 1 4.95 NaN
1 2 5.04 NaN
2 3 4.88 -0.0900
3 4 4.22 0.0350
4 5 5.67 0.2350
5 6 5.89 -0.0620
6 7 5.50 -0.1694
7 8 5.12 -0.1937