我想在数据帧中添加一列,其中包含基于时间戳的值(注释(,每天分组。我按照下面的例子做的,但是。。。还有其他的";pandonic";方法也许是一艘班轮,或者至少离它很近?
示例数据帧(实际数据有更多的日期和更多不同的值(:
import pandas as pd
data = {"Values": ["absd","abse", "dara", "absd","abse", "dara"],
"Date": ["2022-05-25","2022-05-25","2022-05-25", "2022-05-26","2022-05-26","2022-05-26"],
"Timestamp": ["2022-05-25 08:00:00", "2022-05-25 11:30:00", "2022-05-25 20:25:00",
"2022-05-26 09:00:00", "2022-05-26 13:40:00", "2022-05-26 19:15:00"]}
df = pd.DataFrame(data)
df.Timestamp = pd.to_datetime(df.Timestamp, format='%Y-%m-%d %H:%M:%S')
df.Date = pd.to_datetime(df.Date, format='%Y-%m-%d')
df-out:
Values Date Timestamp
0 absd 2022-05-25 2022-05-25 08:00:00
1 abse 2022-05-25 2022-05-25 11:30:00
2 dara 2022-05-25 2022-05-25 20:25:00
3 absd 2022-05-26 2022-05-26 09:00:00
4 abse 2022-05-26 2022-05-26 13:40:00
5 dara 2022-05-26 2022-05-26 19:15:00
我想要的最终结果是:
Values Date Period Datetime
0 absd 2022-05-25 Start 2022-05-25 08:00:00
1 abse 2022-05-25 Start 2022-05-25 08:00:00
2 dara 2022-05-25 Start 2022-05-25 08:00:00
3 dara 2022-05-25 Mid 2022-05-25 11:30:00
4 abse 2022-05-25 Mid 2022-05-25 11:30:00
5 absd 2022-05-25 Mid 2022-05-25 11:30:00
6 dara 2022-05-25 End 2022-05-25 20:25:00
7 abse 2022-05-25 End 2022-05-25 20:25:00
8 absd 2022-05-25 End 2022-05-25 20:25:00
9 dara 2022-05-26 Start 2022-05-26 09:00:00
10 abse 2022-05-26 Start 2022-05-26 09:00:00
11 absd 2022-05-26 Start 2022-05-26 09:00:00
12 absd 2022-05-26 Mid 2022-05-26 13:40:00
13 abse 2022-05-26 Mid 2022-05-26 13:40:00
14 dara 2022-05-26 Mid 2022-05-26 13:40:00
15 absd 2022-05-26 End 2022-05-26 19:15:00
16 abse 2022-05-26 End 2022-05-26 19:15:00
17 dara 2022-05-26 End 2022-05-26 19:15:00
我的工作方法如下:
df["Start"] = df["Timestamp"].groupby(df["Date"]).transform("min")
df["End"] = df["Timestamp"].groupby(df["Date"]).transform("max")
df["Mid"] = df["Timestamp"].groupby(df["Date"]).transform("median")
df1 = df.melt(id_vars = ["Values","Date"],
var_name="Period",value_name="Datetime").sort_values("Datetime")
df1 = df1[df1.Period != "Timestamp"].reset_index(drop=True)
从最终结果数据帧来看,似乎需要所有列的组合(好吧,Values
列和(‘Date’,Timestamp’(列的组合(。
一个选项是完整的pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.assign(Period = ['Start', 'Mid', 'End'] * 2)
.complete(('Date', 'Timestamp', 'Period'), 'Values')
)
Values Date Timestamp Period
0 absd 2022-05-25 2022-05-25 08:00:00 Start
1 abse 2022-05-25 2022-05-25 08:00:00 Start
2 dara 2022-05-25 2022-05-25 08:00:00 Start
3 absd 2022-05-25 2022-05-25 11:30:00 Mid
4 abse 2022-05-25 2022-05-25 11:30:00 Mid
5 dara 2022-05-25 2022-05-25 11:30:00 Mid
6 absd 2022-05-25 2022-05-25 20:25:00 End
7 abse 2022-05-25 2022-05-25 20:25:00 End
8 dara 2022-05-25 2022-05-25 20:25:00 End
9 absd 2022-05-26 2022-05-26 09:00:00 Start
10 abse 2022-05-26 2022-05-26 09:00:00 Start
11 dara 2022-05-26 2022-05-26 09:00:00 Start
12 absd 2022-05-26 2022-05-26 13:40:00 Mid
13 abse 2022-05-26 2022-05-26 13:40:00 Mid
14 dara 2022-05-26 2022-05-26 13:40:00 Mid
15 absd 2022-05-26 2022-05-26 19:15:00 End
16 abse 2022-05-26 2022-05-26 19:15:00 End
17 dara 2022-05-26 2022-05-26 19:15:00 End
仅使用熊猫:
(
df['Timestamp'].groupby(df['Date']).agg(['min','median','max']).merge(df, on='Date')
.melt(id_vars=['Values','Date'], var_name='Period', value_name='Datetime')
.query('Period!="Timestamp"')
.sort_values('Datetime')
)
输出:
Values Date Period Datetime
0 absd 2022-05-25 min 2022-05-25 08:00:00
1 abse 2022-05-25 min 2022-05-25 08:00:00
2 dara 2022-05-25 min 2022-05-25 08:00:00
7 abse 2022-05-25 median 2022-05-25 11:30:00
6 absd 2022-05-25 median 2022-05-25 11:30:00
8 dara 2022-05-25 median 2022-05-25 11:30:00
12 absd 2022-05-25 max 2022-05-25 20:25:00
13 abse 2022-05-25 max 2022-05-25 20:25:00
14 dara 2022-05-25 max 2022-05-25 20:25:00
4 abse 2022-05-26 min 2022-05-26 09:00:00
3 absd 2022-05-26 min 2022-05-26 09:00:00
5 dara 2022-05-26 min 2022-05-26 09:00:00
9 absd 2022-05-26 median 2022-05-26 13:40:00
10 abse 2022-05-26 median 2022-05-26 13:40:00
11 dara 2022-05-26 median 2022-05-26 13:40:00
16 abse 2022-05-26 max 2022-05-26 19:15:00
15 absd 2022-05-26 max 2022-05-26 19:15:00
17 dara 2022-05-26 max 2022-05-26 19:15:00
另一种仅限熊猫的方法:
out = (df.groupby('Date')
.agg({'Timestamp':['min', 'median', 'max'], 'Values':list})
.explode(('Values', 'list'))
.droplevel(0, axis=1)
.rename(columns={'list':'Values'})
.reset_index()
.melt(['Values', 'Date'], var_name='Period', value_name='Datetime')
.sort_values('Datetime', ignore_index=True))
print(out)
输出:
Values Date Period Datetime
0 absd 2022-05-25 min 2022-05-25 08:00:00
1 abse 2022-05-25 min 2022-05-25 08:00:00
2 dara 2022-05-25 min 2022-05-25 08:00:00
3 abse 2022-05-25 median 2022-05-25 11:30:00
4 absd 2022-05-25 median 2022-05-25 11:30:00
5 dara 2022-05-25 median 2022-05-25 11:30:00
6 absd 2022-05-25 max 2022-05-25 20:25:00
7 abse 2022-05-25 max 2022-05-25 20:25:00
8 dara 2022-05-25 max 2022-05-25 20:25:00
9 abse 2022-05-26 min 2022-05-26 09:00:00
10 absd 2022-05-26 min 2022-05-26 09:00:00
11 dara 2022-05-26 min 2022-05-26 09:00:00
12 absd 2022-05-26 median 2022-05-26 13:40:00
13 abse 2022-05-26 median 2022-05-26 13:40:00
14 dara 2022-05-26 median 2022-05-26 13:40:00
15 abse 2022-05-26 max 2022-05-26 19:15:00
16 absd 2022-05-26 max 2022-05-26 19:15:00
17 dara 2022-05-26 max 2022-05-26 19:15:00