熊猫一组一组地融化掉



我想在数据帧中添加一列,其中包含基于时间戳的值(注释(,每天分组。我按照下面的例子做的,但是。。。还有其他的";pandonic";方法也许是一艘班轮,或者至少离它很近?

示例数据帧(实际数据有更多的日期和更多不同的值(:

import pandas as pd
data = {"Values": ["absd","abse", "dara", "absd","abse", "dara"],
"Date": ["2022-05-25","2022-05-25","2022-05-25", "2022-05-26","2022-05-26","2022-05-26"],
"Timestamp": ["2022-05-25 08:00:00", "2022-05-25 11:30:00", "2022-05-25 20:25:00", 
"2022-05-26 09:00:00", "2022-05-26 13:40:00", "2022-05-26 19:15:00"]}
df = pd.DataFrame(data)
df.Timestamp = pd.to_datetime(df.Timestamp, format='%Y-%m-%d %H:%M:%S')
df.Date = pd.to_datetime(df.Date, format='%Y-%m-%d')

df-out:

Values       Date           Timestamp
0   absd 2022-05-25 2022-05-25 08:00:00
1   abse 2022-05-25 2022-05-25 11:30:00
2   dara 2022-05-25 2022-05-25 20:25:00
3   absd 2022-05-26 2022-05-26 09:00:00
4   abse 2022-05-26 2022-05-26 13:40:00
5   dara 2022-05-26 2022-05-26 19:15:00

我想要的最终结果是:

Values       Date Period            Datetime
0    absd 2022-05-25  Start 2022-05-25 08:00:00
1    abse 2022-05-25  Start 2022-05-25 08:00:00
2    dara 2022-05-25  Start 2022-05-25 08:00:00
3    dara 2022-05-25    Mid 2022-05-25 11:30:00
4    abse 2022-05-25    Mid 2022-05-25 11:30:00
5    absd 2022-05-25    Mid 2022-05-25 11:30:00
6    dara 2022-05-25    End 2022-05-25 20:25:00
7    abse 2022-05-25    End 2022-05-25 20:25:00
8    absd 2022-05-25    End 2022-05-25 20:25:00
9    dara 2022-05-26  Start 2022-05-26 09:00:00
10   abse 2022-05-26  Start 2022-05-26 09:00:00
11   absd 2022-05-26  Start 2022-05-26 09:00:00
12   absd 2022-05-26    Mid 2022-05-26 13:40:00
13   abse 2022-05-26    Mid 2022-05-26 13:40:00
14   dara 2022-05-26    Mid 2022-05-26 13:40:00
15   absd 2022-05-26    End 2022-05-26 19:15:00
16   abse 2022-05-26    End 2022-05-26 19:15:00
17   dara 2022-05-26    End 2022-05-26 19:15:00

我的工作方法如下:

df["Start"] = df["Timestamp"].groupby(df["Date"]).transform("min")
df["End"] = df["Timestamp"].groupby(df["Date"]).transform("max")
df["Mid"] = df["Timestamp"].groupby(df["Date"]).transform("median")
df1 = df.melt(id_vars = ["Values","Date"],
var_name="Period",value_name="Datetime").sort_values("Datetime")
df1 = df1[df1.Period != "Timestamp"].reset_index(drop=True)

从最终结果数据帧来看,似乎需要所有列的组合(好吧,Values列和(‘Date’,Timestamp’(列的组合(。

一个选项是完整的pyjanitor:

# pip install pyjanitor
import pandas as pd
import janitor
(df
.assign(Period = ['Start', 'Mid', 'End'] * 2)
.complete(('Date', 'Timestamp', 'Period'), 'Values')
)
Values       Date           Timestamp Period
0    absd 2022-05-25 2022-05-25 08:00:00  Start
1    abse 2022-05-25 2022-05-25 08:00:00  Start
2    dara 2022-05-25 2022-05-25 08:00:00  Start
3    absd 2022-05-25 2022-05-25 11:30:00    Mid
4    abse 2022-05-25 2022-05-25 11:30:00    Mid
5    dara 2022-05-25 2022-05-25 11:30:00    Mid
6    absd 2022-05-25 2022-05-25 20:25:00    End
7    abse 2022-05-25 2022-05-25 20:25:00    End
8    dara 2022-05-25 2022-05-25 20:25:00    End
9    absd 2022-05-26 2022-05-26 09:00:00  Start
10   abse 2022-05-26 2022-05-26 09:00:00  Start
11   dara 2022-05-26 2022-05-26 09:00:00  Start
12   absd 2022-05-26 2022-05-26 13:40:00    Mid
13   abse 2022-05-26 2022-05-26 13:40:00    Mid
14   dara 2022-05-26 2022-05-26 13:40:00    Mid
15   absd 2022-05-26 2022-05-26 19:15:00    End
16   abse 2022-05-26 2022-05-26 19:15:00    End
17   dara 2022-05-26 2022-05-26 19:15:00    End

仅使用熊猫:

(
df['Timestamp'].groupby(df['Date']).agg(['min','median','max']).merge(df, on='Date')
.melt(id_vars=['Values','Date'], var_name='Period', value_name='Datetime')
.query('Period!="Timestamp"')
.sort_values('Datetime')
)

输出:

Values       Date  Period            Datetime
0    absd 2022-05-25     min 2022-05-25 08:00:00
1    abse 2022-05-25     min 2022-05-25 08:00:00
2    dara 2022-05-25     min 2022-05-25 08:00:00
7    abse 2022-05-25  median 2022-05-25 11:30:00
6    absd 2022-05-25  median 2022-05-25 11:30:00
8    dara 2022-05-25  median 2022-05-25 11:30:00
12   absd 2022-05-25     max 2022-05-25 20:25:00
13   abse 2022-05-25     max 2022-05-25 20:25:00
14   dara 2022-05-25     max 2022-05-25 20:25:00
4    abse 2022-05-26     min 2022-05-26 09:00:00
3    absd 2022-05-26     min 2022-05-26 09:00:00
5    dara 2022-05-26     min 2022-05-26 09:00:00
9    absd 2022-05-26  median 2022-05-26 13:40:00
10   abse 2022-05-26  median 2022-05-26 13:40:00
11   dara 2022-05-26  median 2022-05-26 13:40:00
16   abse 2022-05-26     max 2022-05-26 19:15:00
15   absd 2022-05-26     max 2022-05-26 19:15:00
17   dara 2022-05-26     max 2022-05-26 19:15:00

另一种仅限熊猫的方法:

out = (df.groupby('Date')
.agg({'Timestamp':['min', 'median', 'max'], 'Values':list})
.explode(('Values', 'list'))
.droplevel(0, axis=1)
.rename(columns={'list':'Values'})
.reset_index()
.melt(['Values', 'Date'], var_name='Period', value_name='Datetime')
.sort_values('Datetime', ignore_index=True))
print(out)

输出:

Values       Date  Period            Datetime
0    absd 2022-05-25     min 2022-05-25 08:00:00
1    abse 2022-05-25     min 2022-05-25 08:00:00
2    dara 2022-05-25     min 2022-05-25 08:00:00
3    abse 2022-05-25  median 2022-05-25 11:30:00
4    absd 2022-05-25  median 2022-05-25 11:30:00
5    dara 2022-05-25  median 2022-05-25 11:30:00
6    absd 2022-05-25     max 2022-05-25 20:25:00
7    abse 2022-05-25     max 2022-05-25 20:25:00
8    dara 2022-05-25     max 2022-05-25 20:25:00
9    abse 2022-05-26     min 2022-05-26 09:00:00
10   absd 2022-05-26     min 2022-05-26 09:00:00
11   dara 2022-05-26     min 2022-05-26 09:00:00
12   absd 2022-05-26  median 2022-05-26 13:40:00
13   abse 2022-05-26  median 2022-05-26 13:40:00
14   dara 2022-05-26  median 2022-05-26 13:40:00
15   abse 2022-05-26     max 2022-05-26 19:15:00
16   absd 2022-05-26     max 2022-05-26 19:15:00
17   dara 2022-05-26     max 2022-05-26 19:15:00

最新更新