假定我的数据集
Name date
A 15-01-01
A 15-01-31
A 16-02-02
A 16-03-04
B 17-04-05
B 17-05-08
B 17-07-09
C 18-01-02
C 18-02-03
我想将每个人的第一个和最后一个约会添加到一个新列中。
结果是我想要的
Name date startdate enddate
A 15-01-01 15-01-01 16-03-04
A 15-01-31 15-01-01 16-03-04
A 16-02-02 15-01-01 16-03-04
A 16-03-04 15-01-01 16-03-04
B 17-04-05 17-04-05 17-07-09
B 17-05-08 17-04-05 17-07-09
B 17-07-09 17-04-05 17-07-09
C 18-01-02 18-01-02 18-02-03
C 18-02-03 18-01-02 18-02-03
有没有一种方法可以做到这一点?如果您让我知道,我将不胜感激。
感谢您的阅读
将GroupBy.transform
与GroupBy.first
或GroupBy.last
一起使用并添加到新列:
g = df.groupby('Name')['date']
df = df.assign(startdate = g.transform('first'), enddate = g.transform('last'))
print (df)
Name date startdate enddate
0 A 15-01-01 15-01-01 16-03-04
1 A 15-01-31 15-01-01 16-03-04
2 A 16-02-02 15-01-01 16-03-04
3 A 16-03-04 15-01-01 16-03-04
4 B 17-04-05 17-04-05 17-07-09
5 B 17-05-08 17-04-05 17-07-09
6 B 17-07-09 17-04-05 17-07-09
7 C 18-01-02 18-01-02 18-02-03
8 C 18-02-03 18-01-02 18-02-03
或者如果需要最小和最大与GroupBy.min
和GroupBy.max
:
df['date'] = pd.to_datetime(df['date'], format='%y-%m-%d')
g = df.groupby('Name')['date']
df = df.assign(startdate = g.transform('min'), enddate = g.transform('max'))
print (df)
Name date startdate enddate
0 A 2015-01-01 2015-01-01 2016-03-04
1 A 2015-01-31 2015-01-01 2016-03-04
2 A 2016-02-02 2015-01-01 2016-03-04
3 A 2016-03-04 2015-01-01 2016-03-04
4 B 2017-04-05 2017-04-05 2017-07-09
5 B 2017-05-08 2017-04-05 2017-07-09
6 B 2017-07-09 2017-04-05 2017-07-09
7 C 2018-01-02 2018-01-02 2018-02-03
8 C 2018-02-03 2018-01-02 2018-02-03
带GroupBy.agg
带DataFrame.join
的替代方案:
df1 = df.groupby('Name').agg(startdate = ('date','first'), lastdate = ('date','last'))
df = df.join(df1, on='Name')
有一种方法可以做到这一点。虽然我不是很确定,但它应该有效。
df['date'] = pd.to_datetime(df['date'], format='%y-%m-%d')
start_date =[]
end_date = []
first_ittr = True;
index = 0
for i in df['date']:
if(first_ittr):
start_date[index] = i
end_date[index]=i
first_ittr = False
elif(i>end_date):
end_date[index] = i
elif(i<start_date):
start_date[index] = i
df['startdate'] = start_date
df['enddate'] = end_date