我目前有一个pandas数据帧,其中每行都有一个日期范围,我想计算这个范围内符合某些条件的天数:
Item | Date Start | Date End
----------------------
A | 02/01/2019 | 03/02/2019
B | 04/02/2019 | 08/02/2019
例如:2019年1月内的天数或2019年的天数:
Item | Date Start | Date End | Days in Jan-2019 | Days in 2019 | Days in Q1 - 2019
------------------------------------------------------------------------------------
A | 02/01/2019 | 03/02/2019 | 30 | 33 | 33
B | 04/04/2019 | 08/04/2019 | 0 | 5 | 0
理想情况下,我不想为范围内的每个日期都创建一行来进行计算,并且可以保持行结构不变,但无法确定如何进行计算,或者最有效的方法是什么。
谢谢!
这里有必要为每对创建范围,如果只有几个条件可能,则使用sum
和条件:
df['Date Start'] = pd.to_datetime(df['Date Start'], dayfirst=True)
df['Date End'] = pd.to_datetime(df['Date End'], dayfirst=True)
s = df.apply(lambda x: pd.date_range(x['Date Start'], x['Date End']), axis=1)
df['Days in Jan-2019'] = s.apply(lambda x: ((x.year == 2019) & (x.month == 1)).sum())
df['Days in 2019'] = s.apply(lambda x: (x.year == 2019).sum())
df['Days in Q1 2019'] = s.apply(lambda x: ((x.year == 2019) & (x.quarter == 1)).sum())
print (df)
Item Date Start Date End Days in Jan-2019 Days in 2019 Days in Q1 2019
0 A 2019-01-02 2019-02-03 30 33 33
1 B 2019-02-04 2019-02-08 0 5 5
另一个想法是使用DataFrame.explode
来压平日期范围,并按级别值聚合sum
:
df['Date Start'] = pd.to_datetime(df['Date Start'], dayfirst=True)
df['Date End'] = pd.to_datetime(df['Date End'], dayfirst=True)
df['r'] = df.apply(lambda x: pd.date_range(x['Date Start'], x['Date End']), axis=1)
df1 = df.explode('r')
df1['Days in Jan-2019'] = (df1['r'].dt.year == 2019) & (df1['r'].dt.month == 1)
df1['Days in 2019'] = df1['r'].dt.year == 2019
df1['Days in Q1 2019'] = (df1['r'].dt.year == 2019) & (df1['r'].dt.quarter == 1)
df = df.drop('r', axis=1).join(df1.sum(level=0))
print (df)
Item Date Start Date End Days in Jan-2019 Days in 2019 Days in Q1 2019
0 A 2019-01-02 2019-02-03 30 33 33
1 B 2019-02-04 2019-02-08 0 5 5