熊猫:获取数据帧的每日描述

我有一个看起来像这样的数据帧：

provider    timestamp                   vehicle_id
id          
103107  a           2019-09-11 20:05:47+02:00   x
1192195 b           2019-09-11 00:02:46+02:00   y
434508  c           2019-09-11 00:32:39+02:00   z
530388  c           2019-09-11 08:12:56+02:00   z
1773721 b           2019-09-11 20:02:55+02:00   w
...

我想获得有关每天不同vehicle_ids的一些统计数据。我有这个，它允许我手动执行describe：

df.groupby(['provider', df['timestamp'].dt.strftime('%Y-%m-%d')])[['vehicle_id']].nunique()：

vehicle_id
provider    timestamp   
a           2019-09-11  1224
2019-09-12  1054
b           2019-09-11  2859
2019-09-12  2761
2019-09-17  700

如何整理数据，以便获得每天的每日最小值/最大值/平均值？我有点迷茫，非常感谢任何帮助。

尝试groupby().agg()：

new_df.groupby('timestamp').vehicle_id.agg({'min','max','mean'})

注意：由于您只关心原始数据中的一列，因此您可以在第一个分组依据中传递一个序列而不是数据框，即

# note the number of [] around 'vehicle_id'
new_df = (df.groupby(['provider', 
df['timestamp'].dt.strftime('%Y-%m-%d')])
['vehicle_id'].nunique()
)

然后new_df是一个名为vehicle_id的序列，下一个命令只是

# note the difference before .agg
new_df.groupby('timestamp').agg({'min', 'max', 'mean'})

试试这个：

aggregations = ['mean', 'min', 'max', 'std']
result = grouped_df.groupby('timestamp')[vehicle_id].agg(aggregations)

注意：您可能需要先展平列索引：

grouped_df.columns = [col[1] if col[1] != '' else col[0] for col in grouped_df.columns]

如果我正确理解您的问题，您需要做的就是：

df.groupby(['provider', df['timestamp'].dt.strftime('%Y-%m-%d')])[['vehicle_id']].nunique()
.groupby('timestamp')['vehicle_id'].describe()

在第一个分组依据中，你将获得数据帧，其中包含按provider和天划分的唯一vehicle_id数。对于提供的数据示例，它是：

vehicle_id
provider timestamp             
a        2019-09-11           1
b        2019-09-11           2
c        2019-09-11           1

在第二个中，它将是每天的统计数据。所以结果将是

count      mean      std  min  25%  50%  75%  max
timestamp                                                    
2019-09-11    3.0  1.333333  0.57735  1.0  1.0  1.0  1.5  2.0

相关内容

最新更新

热门标签：