我有下面的表
时间戳 | 名称 | 标记主题 | |
---|---|---|---|
2022-01-01 00:00:02.969 | Chris | 70DK | |
2022-01-01 00:00:04.467 | Chris | 75 | DK |
2022-01-01 00:00:05.965 | 标记 | 80DK | |
2022-01-01 00:00:08.962 | 古巴 | 60 | DK|
2022-01-01 00:00:10.461 | 古巴 | 58 | 丹麦 |
您可以尝试以下操作:
rule = "10min"
result = (
df.set_index("TimeStamp").groupby(["Name", "Subject"])
.resample(rule)
.agg(Min=("Marks", "min"), Max=("Marks", "max"), STD=("Marks", "std"))
.unstack(0)
.swaplevel(0, 1).reset_index()
)
- 首先将
TimeStamp
设置为索引,并按Subject
和Name
进行分组,以获得正确的块 - 然后CCD_ 4以给定的频率CCD_
- 然后通过使用具有命名元组的
.agg()
来聚合所需的统计信息 - 取消对第一个索引级别(
Name
(的打包以将其放入列中 - 最后重置索引时,交换剩余的索引级别以获得正确的顺序
给定样本的结果:
TimeStamp Subject Min Max STD
Name Chris Cuban Mark Chris Cuban Mark Chris Cuban Mark
0 2022-01-01 DK 70 58 80 75 60 80 3.535534 1.414214 NaN
如果您希望列与预期输出完全相同,那么您可以添加以下
result = result[
list(result.columns[:2]) + sorted(result.columns[2:], key=lambda c: c[1])
]
result.columns = [f"{lev1}_{lev0}" if lev1 else lev0 for lev0, lev1 in result.columns]
获取
TimeStamp Subject Chris_Min Chris_Max ... Cuban_STD Mark_Min Mark_Max Mark_STD
0 2022-01-01 DK 70 75 ... 1.414214 80 80 NaN
如果您收到TypeError: aggregate() missing 1 required positional argument...
错误(注释已消失(,则可能是您使用的是无法处理命名元组的旧Pandas版本。您可以尝试以下方法:
rule = "10min"
result = (
df.set_index("TimeStamp").groupby(["Name", "Subject"])
.resample(rule)
.agg({"Marks": ["min", "max", "std"]})
.droplevel(0, axis=1)
.unstack(0)
.swaplevel(0, 1).reset_index()
)
...
您的表是pandas数据帧吗?如果是熊猫数据帧,可以使用重采样:
# only if timestamp is not the index yet:
df = df.set_index('TimeStamp')
# the important part, you can use any function in agg or some str for simple
# functions like mean:
df = df.resample('10Min').agg('max','min')
# only if you had to set index to timestamp and want to go back to normal index:
df = df.reset_index()
编辑以获取函数中的第二个表:
# choose aggregation function
agg_functions = ['min', 'max', 'std']
# set_index on time column, resample
resampled_df = df.set_index('TimeStamp').resample('10Min').agg(agg_functions)
# flatten multiindex
resampled_df.columns = resampled_df.columns.map('_'.join)
# drop time column
resampled_df = resampled_df.reset_index(drop=True)
# concatenate with original df
pd.concat([df, resampled_df], axis=1)