我有一个数据帧,我想根据另一个变量将日期序列转换为开始日期和结束日期。
以下是示例:
Input = pd.DataFrame({'date': ['2020-10-10', '2020-10-11', '2020-10-12'
, '2020-10-13', '2020-10-14'], 'groupby': ['AAA',
'AAA', 'AAA', 'BBB', 'BBB']})
我需要输出如下:
Output = pd.DataFrame({'StartDate': ['2020-10-10', '2020-10-13'],
'EndDate': ['2020-10-12', '2020-10-14'],
'groupby': ['AAA', 'BBB']})
这就是您想要的吗?
Output = pd.DataFrame()
Output['StartDate'] = Input.groupby('groupby')['date'].first()
Output['EndDate'] = Input.groupby('groupby')['date'].last()
Output
groupby StartDate EndDate
AAA 2020-10-10 2020-10-12
BBB 2020-10-13 2020-10-14
编辑:这修复了顺序日期问题。
Output = pd.DataFrame()
grp = Input.groupby((Input['groupby']!=Input['groupby'].shift()).cumsum())
Output['StartDate'] = grp['date'].first().reset_index(drop=True)
Output['EndDate'] = grp['date'].last().reset_index(drop=True)
Output['groupby'] = grp['groupby'].first().reset_index(drop=True)
Output
StartDate EndDate groupby
0 2020-10-10 2020-10-10 AAA
1 2020-10-11 2020-10-12 BBB
2 2020-10-14 2020-10-14 AAA
您可以使用groupby
和agg
函数来查找日期的min
和max
。然后可以重命名列以获得所需的结果。
import pandas as pd
df = pd.DataFrame({'date': ['2020-10-10', '2020-10-11', '2020-10-12'
, '2020-10-13', '2020-10-14'], 'groupby': ['AAA',
'AAA', 'AAA', 'BBB', 'BBB']})
print (df)
df_result = (df.groupby('groupby').agg({'date':['min','max']})
.reset_index()
.droplevel(0, axis=1)
.rename(columns={'':'groupby','min': 'StartDate','max':'EndDate'}))
print (df_result)
其输出为:
原始数据帧:
date groupby
0 2020-10-10 AAA
1 2020-10-11 AAA
2 2020-10-12 AAA
3 2020-10-13 BBB
4 2020-10-14 BBB
带有开始和结束日期的新数据帧
groupby StartDate EndDate
0 AAA 2020-10-10 2020-10-12
1 BBB 2020-10-13 2020-10-14