大熊猫部分数据的选择问题



我有以下数据库,这些数据库是用panda从csv文件中提取的:

df1=pd.read_csv(path,parse_dates=True) 

df1的打印给出:

control      Avg_return  
2019-09-07          True            0    
2019-06-06          True            0
2019-02-19          True            0
2019-01-17          True            0
2018-12-20          True            0
2018-11-27          True            0
2018-10-12          True            0
...              ...            ...

在我加载2 csv文件之后

df2=pd.read_csv(path,parse_dates=True)

df2的打印给出:

return
2010-01-01          NaN
2010-04-01     0.010920
2010-05-01    -0.004404
2010-06-01    -0.025209
2010-07-01    -0.023280
...            ...

我的代码的目的是:

  1. 从df1中获取日期
  2. 从第1点的日期中减去6天
  3. 从第1点的日期中减去244天
  4. 在df2中获取这两个日期的所有回报
  5. 计算这些回报的平均值,并将其存储在Avg_return中

我做了这个:

for i in range(0,df1_row):                   
#I go through my data df1 
if (control.iloc[i]==True):                        
#I check if control_1 is true
date_1=df1.index[i]-pd.to_timedelta(6, unit='d')    
# I remove 6 days from my date
date_2=df1.index[i]-pd.to_timedelta(244, unit='d')  
# I remove 244 days from my date
df1.loc[i,"Average_return"] = df2[[date_1:date_2],["return"]].mean()
# I want to make the mean of the return between my date-6 days and my date-244 days

不幸的是,它给了我这个错误:

df1.loc[i,"Average_return"] = df2[[date1:date2],["return"]].mean()
^
SyntaxError: invalid syntax

有人能帮我吗?:(

下面看起来有点难看,但我认为它有效:(

伪df:

import numpy as np
import pandas as pd    
cols = ['date', 'control', 'Avg_return']
data = [
[pd.to_datetime('2019-09-07'), True, 0], 
[pd.to_datetime('2019-06-06'), True, 0]
]
df1 = pd.DataFrame(data, columns=cols)
cols2 = ['date', 'return']
data2 = [
[pd.to_datetime('2010-01-01'), np.nan], 
[pd.to_datetime('2010-04-01'), 0.010920], 
[pd.to_datetime('2019-09-01'), 1]
]
df2 = pd.DataFrame(data2, columns=cols2)

草拟解决方案:

df1['date_minus_6'] = df1['date'] - dt.timedelta(days=6)
df1['date_minus_244'] = df1['date'] - dt.timedelta(days=244)
for i in range(0, df1.shape[0]):
for j in range(0, df2.shape[0]):
if df2['date'].iloc[j] == df1['date_minus_6'].iloc[i]:
df1['Avg_return'].iloc[i] = (
df1['Avg_return'].iloc[i] + df2['return'].iloc[j]
).mean()
elif df2['date'].iloc[j] == df1['date_minus_244'].iloc[i]:
df1['Avg_return'].iloc[i] = (
df1['Avg_return'].iloc[i] + df2['return'].iloc[j]
).mean()

输出:

date    control Avg_return  date_minus_6    date_minus_244
0   2019-09-07  True    1.0 2019-09-01  2019-01-06
1   2019-06-06  True    0.0 2019-05-31  2018-10-05
import csv
import pandas as pd
df1=pd.read_csv('dsf1.csv',parse_dates=True)
df2=pd.read_csv('dsf2.csv',parse_dates=True)
df1.columns = ['date', 'control', 'return']
df2.columns = ['date', 'return']
df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])
for i in range(0, df1.shape[0]):
if df1['control'][i] == True:
date_1 = df1['date'][0] - pd.to_timedelta(6, unit='d')
date_2 = df2['date'][0] - pd.to_timedelta(244, unit='d')
#I'm not sure if average_return has the correct condition, but adjust as you see fit
df1.loc[i, 'average_return'] = (df1[df1['date'] > date_1]['return'] -  df2[df2['date'] > date_2]['return']).mean()
print df1

这是一种不同的方法,不需要在所有行上循环:

# make sure your index is a datetime index
df1.index = pd.to_datetime(df1.index)    
df1['date_1'] = df1.index - pd.to_timedelta(6, unit='d') 
df1['date_2'] = df1.index  - pd.to_timedelta(244, unit='d') 
df1['Average_return'] = df1.apply(lambda r: df2.loc[r['date_1']: r['date_2'], 'return'].mean(), axis=1)