组内熊猫变量滞后一年



我希望在Pandas中创建一个滞后的年回报变量。

到目前为止,我已经试过了:

df_ret_lagged = df_ret.set_index(['year', 'cusip'])
df_ret_lagged['yearly_ret_lag'] = df_ret_lagged['year_ret'].shift(12)
df_ret_lagged.reset_index(inplace = True) 

但是,这只是将年度回报向下移动12行,而不是按year分组。下面的数据框显示了代码所做的工作。

year    cusip        date       year_ret    yearly_ret_lag
0   1983    000165100   1983-09-01  0.183673    NaN
1   1983    000165100   1983-10-01  0.183673    NaN
2   1983    000165100   1983-11-01  0.183673    NaN
3   1983    000165100   1983-12-01  0.183673    NaN
4   1984    000165100   1984-01-01  -0.482758   NaN
5   1984    000165100   1984-02-01  -0.482758   NaN
6   1984    000165100   1984-03-01  -0.482758   NaN
7   1984    000165100   1984-04-01  -0.482758   NaN
8   1984    000165100   1984-05-01  -0.482758   NaN
9   1984    000165100   1984-06-01  -0.482758   NaN
10  1984    000165100   1984-07-01  -0.482758   NaN
11  1984    000165100   1984-08-01  -0.482758   NaN
12  1984    000165100   1984-09-01  -0.482758   0.183673
13  1984    000165100   1984-10-01  -0.482758   0.183673
14  1984    000165100   1984-11-01  -0.482758   0.183673
15  1984    000165100   1984-12-01  -0.482758   0.183673
16  1985    000165100   1985-01-01  1.700000    -0.482758
17  1985    000165100   1985-02-01  1.700000    -0.482758
18  1985    000165100   1985-03-01  1.700000    -0.482758
19  1985    000165100   1985-04-01  1.700000    -0.482758

理想情况下,我希望1983年year_ret填充所有1984年的日期,以此类推。此外,这些必须按cusip(公司标识符)分组。

谢谢!

我使用了for循环:

for year in df['year'].unique()[1:]: #list of all the years except the first
df.loc[df['year'] == year, 'year_ret_lag'] = df.loc[df['year'] == year-1, 'year_ret'].iloc[0]

df

year    cusip   date        year_ret    year_ret_lag
0   1983    165100  01/09/1983  0.183673    NaN
1   1983    165100  01/10/1983  0.183673    NaN
2   1983    165100  01/11/1983  0.183673    NaN
3   1983    165100  01/12/1983  0.183673    NaN
4   1984    165100  01/01/1984  -0.482758   0.183673
5   1984    165100  01/02/1984  -0.482758   0.183673
6   1984    165100  01/03/1984  -0.482758   0.183673
7   1984    165100  01/04/1984  -0.482758   0.183673
8   1984    165100  01/05/1984  -0.482758   0.183673
9   1984    165100  01/06/1984  -0.482758   0.183673
10  1984    165100  01/07/1984  -0.482758   0.183673
11  1984    165100  01/08/1984  -0.482758   0.183673
12  1984    165100  01/09/1984  -0.482758   0.183673
13  1984    165100  01/10/1984  -0.482758   0.183673
14  1984    165100  01/11/1984  -0.482758   0.183673
15  1984    165100  01/12/1984  -0.482758   0.183673
16  1985    165100  01/01/1985  1.700000    -0.482758
17  1985    165100  01/02/1985  1.700000    -0.482758
18  1985    165100  01/03/1985  1.700000    -0.482758
19  1985    165100  01/04/1985  1.700000    -0.482758

我想这可能是你想要的。请注意,这依赖于提前对数据框架进行正确排序和结构化(例如,每个月都有条目)。

在移动之前按词尖和日期对所有内容进行排序,然后通过用nan覆盖它们来擦除词尖之间不匹配的值。然后,您可以使用.fillna(method='bfill')来获取那里的早期值。

df_new = df_ret.sort_values(['cusip','date'])
df_new['yearly_ret_lag'] = df_new['year_ret'].shift(12)
df_new.loc[ (df_new['cusip'] != df_new['cusip'].shift(12)) ,'yearly_ret_lag'] = np.nan
df_new['yearly_ret_lag'] = df_new['yearly_ret_lag'].fillna(method='bfill')

另一个不带循环的解决方案是:

构建df:

dates = pd.date_range("1983-09-01","1985-12-31",freq="1M")
df = pd.DataFrame(index =dates,columns=["Year","cusip","year_ret"])
df['Year'] = df.index
df['Year'] = df['Year'].dt.strftime(date_format='%Y')
df['cusip'] = '01234'
df['year_ret'] =[0.183673,0.183673,0.183673,0.183673,-0.482758,-0.482758,-0.482758,-0.482758,-0.482758,-0.482758,-0.482758,-0.482758,-0.482758,-0.482758,-0.482758,-0.482758,1.700000,1.700000,1.700000,1.700000,1.700000,1.700000,1.700000,1.700000,1.700000,1.700000,1.700000,1.700000]

和代码:

#### First, condition if year changes
_condition_1 = df.Year != df.Year.shift(1)
#### If condition is True, put the past ret as new
df['lag'] = np.where(_condition_1,df['year_ret'].shift(1),np.nan)
#### Fill the nan, and it's ok
df = df.fillna(method='ffill')

最新更新