添加Time Until Holiday,引用另一个数据帧



我有一个日期为的大数据帧

df = pd.read_json('{"dt":{"0":"2016-05-12","1":"2016-06-01","2":"2016-05-02","3":"2016-05-11","4":"2016-05-30","5":"2016-05-19","6":"2016-05-25","7":"2016-05-09","8":"2016-05-03","9":"2016-05-18"}}')
df
+------------+
|     dt     |
+------------+
| 2016-05-12 |
| 2016-06-01 |
| 2016-05-02 |
| 2016-05-11 |
| 2016-05-30 |
| 2016-05-19 |
| 2016-05-25 |
| 2016-05-09 |
| 2016-05-03 |
| 2016-05-18 |
+------------+

我还有一个包含特定假期的参考表。事实上,这些节日跨越了好几年,所以名字并不是唯一的,而是日期和名字的组合

dfHolidays = pd.read_json('{"event_dt":{"0":"2015-04-05","1":"2016-03-27","2":"2015-01-01","3":"2015-02-17","4":"2015-02-18","5":"2015-04-03","6":"2015-04-05","7":"2015-04-21","8":"2015-05-01","9":"2015-06-04","10":"2015-09-07","11":"2015-10-12","12":"2015-11-02","13":"2015-11-15","14":"2015-12-25","15":"2016-01-01","16":"2016-02-09","17":"2016-02-10","18":"2016-03-25","19":"2016-03-27","20":"2016-04-21","21":"2016-05-01","22":"2016-05-26","23":"2016-09-07","24":"2016-10-12","25":"2016-11-02","26":"2016-11-15","27":"2016-12-25"},"event_name":{"0":"Easter","1":"Easter","2":"Ano novo","3":"Carnaval","4":"Quarta-feira de cinzas (In\u00edcio da Quaresma)","5":"Sexta-feira Santa","6":"P\u00e1scoa","7":"Tiradentes","8":"Dia Mundial do Trabalho","9":"Corpus Christi","10":"Independ\u00eancia do Brasil","11":"Nossa Senhora Aparecida","12":"Finados","13":"Proclama\u00e7\u00e3o da Rep\u00fablica","14":"Natal","15":"Ano novo","16":"Carnaval","17":"Quarta-feira de cinzas (In\u00edcio da Quaresma)","18":"Sexta-feira Santa","19":"P\u00e1scoa","20":"Tiradentes","21":"Dia Mundial do Trabalho","22":"Corpus Christi","23":"Independ\u00eancia do Brasil","24":"Nossa Senhora Aparecida","25":"Finados","26":"Proclama\u00e7\u00e3o da Rep\u00fablica","27":"Natal"}}')
dfHolidays
+------------+--------------+
| holiday_dt | holiday_name |
+------------+--------------+
| 2015-04-05 | Easter       |
| 2016-03-27 | Easter       |
| 2015-01-01 | Ano novo     |
| 2015-02-17 | Carnaval     |
+------------+--------------+

我想为主数据帧df添加功能

  • 天直到下一个星期
  • 自上一个星期以来的天数

此外,不仅是下一个假期,而且每个特定的假期都列出了

  • 天_直到下一次_没有新的,天_从上一次_无新的
  • 下一次学习前的天数,上一次学习后的天数

您可以使用pd.merge_asof获取未来或过去最近的日期来创建所有这些列。本质上,您需要更改合并中使用的dfHolidays的方向和/或支持:

准备数据

df = df.sort_values('dt')
dfHolidays = dfHolidays.sort_values('event_dt')
df['dt'] = pd.to_datetime(df['dt'])
dfHolidays['event_dt'] = pd.to_datetime(dfHolidays['event_dt'])

合并

def calc_days(df, dfHolidays, direction):
""" Returns Series of integer # of days between events. Absolute value """
s = pd.merge_asof(df, dfHolidays, left_on='dt', right_on='event_dt',
direction=direction)
s = (s['event_dt'] - s['dt']).dt.days.abs()
return s
df['until_next'] = calc_days(df, dfHolidays, 'forward')
df['since_prev'] = calc_days(df, dfHolidays, 'backward')
df['until_next_Ano'] = calc_days(df, dfHolidays[dfHolidays['event_name'].eq('Ano novo')], 'forward')
df['since_prev_Ano'] = calc_days(df, dfHolidays[dfHolidays['event_name'].eq('Ano novo')], 'backward')
#...
print(df)
dt  until_next  since_prev  until_next_Ano  since_prev_Ano
2 2016-05-02          17           8             NaN             129
8 2016-05-03         100           4             NaN             150
7 2016-05-09           1          24             NaN             145
3 2016-05-11          15          10             NaN             131
0 2016-05-12          24           1             NaN             122
9 2016-05-18          98           6             NaN             152
5 2016-05-19           8          17             NaN             138
6 2016-05-25           7          18             NaN             139
4 2016-05-30          14          11             NaN             132
1 2016-06-01          23           2             NaN             123

由于您提供的dfHoliday与df中的日期相比没有未来的Ano-novo日期,因此未来将设置为NaN,因为没有合并:

dfHolidays[dfHolidays['event_name'].eq('Ano novo')]
#     event_dt event_name
#2  2015-01-01   Ano novo
#15 2016-01-01   Ano novo

确保您使用的是pandas 1.1.0版本,并且试试这个

dfHolidays = pd.read_json('{"event_dt":{"0":"2015-04-05","1":"2016-03-27","2":"2015-01-01","3":"2015-02-17","4":"2015-02-18","5":"2015-04-03","6":"2015-04-05","7":"2015-04-21","8":"2015-05-01","9":"2015-06-04","10":"2015-09-07","11":"2015-10-12","12":"2015-11-02","13":"2015-11-15","14":"2015-12-25","15":"2016-01-01","16":"2016-02-09","17":"2016-02-10","18":"2016-03-25","19":"2016-03-27","20":"2016-04-21","21":"2016-05-01","22":"2016-05-26","23":"2016-09-07","24":"2016-10-12","25":"2016-11-02","26":"2016-11-15","27":"2016-12-25"},"event_name":{"0":"Easter","1":"Easter","2":"Ano novo","3":"Carnaval","4":"Quarta-feira de cinzas (In\u00edcio da Quaresma)","5":"Sexta-feira Santa","6":"P\u00e1scoa","7":"Tiradentes","8":"Dia Mundial do Trabalho","9":"Corpus Christi","10":"Independ\u00eancia do Brasil","11":"Nossa Senhora Aparecida","12":"Finados","13":"Proclama\u00e7\u00e3o da Rep\u00fablica","14":"Natal","15":"Ano novo","16":"Carnaval","17":"Quarta-feira de cinzas (In\u00edcio da Quaresma)","18":"Sexta-feira Santa","19":"P\u00e1scoa","20":"Tiradentes","21":"Dia Mundial do Trabalho","22":"Corpus Christi","23":"Independ\u00eancia do Brasil","24":"Nossa Senhora Aparecida","25":"Finados","26":"Proclama\u00e7\u00e3o da Rep\u00fablica","27":"Natal"}}')
dfHolidays.event_dt = dfHolidays.event_dt.apply(lambda d: dt.datetime.strptime(d,'%Y-%m-%d'))
df = pd.read_json('{"dt":{"0":"2016-05-12","1":"2016-06-01","2":"2016-05-02","3":"2016-05-11","4":"2016-05-30","5":"2016-05-19","6":"2016-05-25","7":"2016-05-09","8":"2016-05-03","9":"2016-05-18"}}')
df.dt = df.dt.apply(lambda d: dt.datetime.strptime(d,'%Y-%m-%d'))
df[['days_until_next_holiday', 'days_since_last_holiday']] = None
df['days_until_next_' + dfHolidays.event_name.values]=None
df['days_since_last_' + dfHolidays.event_name.values]=None
for i, r in df.iterrows():
df.loc[i, 'days_until_next_' + dfHolidays.event_name] = (dfHolidays.event_dt - r['dt']).apply(lambda x: x.days if x.days>0 else None).to_list()
df.loc[i, 'days_since_last_' + dfHolidays.event_name] = (r['dt'] - dfHolidays.event_dt).apply(lambda x: x.days if x.days>0 else None).to_list()
df.loc[i, 'days_until_next_holiday'] = df.loc[i, 'days_until_next_' + dfHolidays.event_name].min()
df.loc[i, 'days_since_last_holiday'] = df.loc[i, 'days_since_last_' + dfHolidays.event_name].min()
result=pd.DataFrame()
for c in df.columns:
if not df[c].isna().all():
result[c] = df[c]
df = result
print(df)

最新更新