如何用Python将数据处理成20分钟的聚合

我有下面的表

标记7080DK

	时间戳	名称	主题
2022-01-01 00:00:02.969	Chris	DK
2022-01-01 00:00:04.467	Chris	75	DK
2022-01-01 00:00:05.965	标记	DK
2022-01-01 00:00:08.962	古巴	60
2022-01-01 00:00:10.461	古巴	58	丹麦

您可以尝试以下操作：

rule = "10min"
result = (
df.set_index("TimeStamp").groupby(["Name", "Subject"])
.resample(rule)
.agg(Min=("Marks", "min"), Max=("Marks", "max"), STD=("Marks", "std"))
.unstack(0)
.swaplevel(0, 1).reset_index()
)

首先将TimeStamp设置为索引，并按Subject和Name进行分组，以获得正确的块
然后CCD_ 4以给定的频率CCD_
然后通过使用具有命名元组的.agg()来聚合所需的统计信息
取消对第一个索引级别(Name(的打包以将其放入列中
最后重置索引时，交换剩余的索引级别以获得正确的顺序

给定样本的结果：

TimeStamp Subject   Min              Max                  STD               
Name                    Chris Cuban Mark Chris Cuban Mark     Chris     Cuban Mark
0    2022-01-01      DK    70    58   80    75    60   80  3.535534  1.414214  NaN

如果您希望列与预期输出完全相同，那么您可以添加以下

result = result[
list(result.columns[:2]) + sorted(result.columns[2:], key=lambda c: c[1])
]
result.columns = [f"{lev1}_{lev0}" if lev1 else lev0 for lev0, lev1 in result.columns]

获取

TimeStamp Subject  Chris_Min  Chris_Max  ...  Cuban_STD  Mark_Min  Mark_Max  Mark_STD
0 2022-01-01      DK         70         75  ...   1.414214        80        80       NaN

如果您收到TypeError: aggregate() missing 1 required positional argument...错误(注释已消失(，则可能是您使用的是无法处理命名元组的旧Pandas版本。您可以尝试以下方法：

rule = "10min"
result = (
df.set_index("TimeStamp").groupby(["Name", "Subject"])
.resample(rule)
.agg({"Marks": ["min", "max", "std"]})
.droplevel(0, axis=1)
.unstack(0)
.swaplevel(0, 1).reset_index()
)
...

您的表是pandas数据帧吗？如果是熊猫数据帧，可以使用重采样：

# only if timestamp is not the index yet:
df = df.set_index('TimeStamp') 
# the important part, you can use any function in agg or some str for simple 
# functions like mean:
df = df.resample('10Min').agg('max','min') 
# only if you had to set index to timestamp and want to go back to normal index:
df = df.reset_index()

编辑以获取函数中的第二个表：

# choose aggregation function
agg_functions = ['min', 'max', 'std']
# set_index on time column, resample
resampled_df = df.set_index('TimeStamp').resample('10Min').agg(agg_functions)
# flatten multiindex
resampled_df.columns =  resampled_df.columns.map('_'.join)
# drop time column 
resampled_df = resampled_df.reset_index(drop=True)
# concatenate with original df
pd.concat([df, resampled_df], axis=1)

相关内容

最新更新

热门标签：