在Python中计算一行观测值的非零点的斜率



我的数据框架看起来像这样:

df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8], 'price':[4.95, 5.04, 4.88, 4.22, 5.67, 5.89, 5.50, 5.12]})
pd.set_option('display.max_Columns', None)
for lag in range(1,7):
df[f'price_lag{lag}M'] = df['price'].shift(lag)
print(df)
>>
date  price  price_lag1M  price_lag2M  price_lag3M  price_lag4M  
0      1   4.95          NaN          NaN          NaN          NaN   
1      2   5.04         4.95          NaN          NaN          NaN   
2      3   4.88         5.04         4.95          NaN          NaN   
3      4   4.22         4.88         5.04         4.95          NaN   
4      5   5.67         4.22         4.88         5.04         4.95   
5      6   5.89         5.67         4.22         4.88         5.04   
6      7   5.50         5.89         5.67         4.22         4.88   
7      8   5.12         5.50         5.89         5.67         4.22   
price_lag5M  price_lag6M  
0          NaN          NaN  
1          NaN          NaN  
2          NaN          NaN  
3          NaN          NaN  
4          NaN          NaN  
5         4.95          NaN  
6         5.04         4.95  
7         4.88         5.04  

我想计算每个月的滞后斜率。我一直在用np。polyfit,虽然它相当快,但如果行中至少有一个NaN,它会给我NaN。

X = [1,2,3,4,5,6]
vars_to_consider = [f'price_lag{i}M' for i in range(1,7)]
Y = df.loc[:, vars_to_consider].values.T
df.loc[:, 'price_trend_6M'] = np.polyfit(X,Y,1)[0].round(4)
df = df.drop(vars_to_consider, axis=1)
print(df)
>>
date  price  price_trend_6M
0      1   4.95             NaN
1      2   5.04             NaN
2      3   4.88             NaN
3      4   4.22             NaN
4      5   5.67             NaN
5      6   5.89             NaN
6      7   5.50         -0.1694
7      8   5.12         -0.1937

我想计算任何非空值的斜率,忽略空值,但对于所有行。对于像这样的小数据,我会这样做:

vars_to_consider = [f'price_lag{i}M' for i in range(1,7)]
for i in range(len(df)):
Y = df.loc[i, vars_to_consider].values
idx = np.where(~np.isnan(Y))[0]
if len(idx) < 2:
df.loc[i, 'price_trend_6M'] = np.nan
else:
df.loc[i, 'price_trend_6M'] = np.polyfit(np.arange(len(idx)), Y[idx], 1)[0].round(4)
df = df.drop(vars_to_consider, axis=1)
print(df)
>>
month  price  price_trend_6M
0      1   4.95             NaN
1      2   5.04             NaN
2      3   4.88         -0.0900
3      4   4.22          0.0350
4      5   5.67          0.2350
5      6   5.89         -0.0620
6      7   5.50         -0.1694
7      8   5.12         -0.1937

然而,原始数据帧大约有300k行长,并且有大约80个变量,如"价格",我想计算趋势。所以第二种方法耗时太长。是否有更快的方法来实现第二个输出?

认识到,由于您的最大shift是6行,np.polyfit将仅为前六行返回nan。您可以继续对整个数据框使用np.polyfit,然后简单地遍历前六行来纠正它们。因为您知道您将只迭代固定的、少量的行,这将比迭代所有行要快得多,就像您在第二段代码中展示的那样。

# Vectorized call for the entire DF
# Note that X needs to be an array for the mask in the loop below to work
X = np.array([1,2,3,4,5,6])
vars_to_consider = [f'price_lag{i}M' for i in range(1,7)]
Y = df.loc[:, vars_to_consider].values.T
df.loc[:, 'price_trend_6M'] = np.polyfit(X,Y,1)[0].round(4)
# Fix first six rows
for i, row in df.head(len(X)).iterrows():
ydata = row.loc[vars_to_consider].values
mask = ~np.isnan(ydata) # Don't need `np.where` if we use boolean indexing
if mask.sum() >= 2: # If >= 2 points, make a polyfit
df.loc[i, 'price_trend_6M'] = np.polyfit(X[mask],ydata[mask],1)[0].round(4)
df = df.drop(vars_to_consider, axis=1)

给出你想要的结果:

date  price  price_trend_6M
0     1   4.95             NaN
1     2   5.04             NaN
2     3   4.88         -0.0900
3     4   4.22          0.0350
4     5   5.67          0.2350
5     6   5.89         -0.0620
6     7   5.50         -0.1694
7     8   5.12         -0.1937

@Pranav的回答很好,可以解决我所提出的问题。我的原始数据有多个id,这些id有多个日期和价格,所以它并不总是前6行。但是,可以用非空值手动计算斜率的行要比总行少得多。这就是我最终使用的:

X = [1,2,3,4,5,6]
vars_to_consider = [f'price_lag{i}M' for i in range(1,7)]
Y = df.loc[:, vars_to_consider].values
df.loc[:, 'price_trend_6M'] = np.polyfit(X,Y.T,1)[0].round(4)
# Select indices where Y is not null
idx = ~np.isnan(Y)
# Count which rows have 2 to 5 nulls, since these rows need mending
idx2 = (idx.sum(axis=1) >= 2) & (idx.sum(axis=1) <= 5)
# Run a for loop with these rows, and calculate slopes with non-null values
for i in np.where(idx2)[0]:
y = Y[i][~np.isnan(Y[i])]
x = np.arange(len(y))
df.loc[i, 'price_trend_6M'] = np.polyfit(x,y,1)[0].round(4)
df = df.drop(vars_to_consider, axis=1)
print(df)
>>
month  price  price_trend_6M
0      1   4.95             NaN
1      2   5.04             NaN
2      3   4.88         -0.0900
3      4   4.22          0.0350
4      5   5.67          0.2350
5      6   5.89         -0.0620
6      7   5.50         -0.1694
7      8   5.12         -0.1937

最新更新