熊猫如何计算数据帧中模式的频率

我有这个示例数据帧：

ID,Action,Station
01,P,S1
01,R,S2
01,P,S1
01,R,S2
02,P,S2
02,R,S1
02,P,S2
02,R,S1
03,P,S2
03,R,S1

我的目标是计算Action和Station列中的八进制模式，例如有序的配对(如 (P，R( 和相应的Station值。以便生成的数据帧将显示为：

S1,S2,2
S2,S1,3

所以要找到的模式是每个ID的(P，R(元组(ID值可能是重复的(，并以Station计算它们的频率。

到目前为止，我的尝试按Action和Station分组，并获取其值计数：

g = df.groupby(['Station','ID'])['Action'].size()

并获得：

Station  ID
S1       1     2
2     2
3     1
S2       1     2
2     2
3     1
Name: Action, dtype: int64

但我仍然无法照顾 (P，R( 元组及其频率。

为每个ID 中的行对定义一个计数器。然后通过与自身合并将 P 和 R 合并在一起，但在一帧中映射 P -> R 和 R> P。删除重复项，因为第二行是多余的，然后获取大小。

这仅有效，因为每个 ID 都有 P 和 R 成对出现，一行接一行

df['idx'] = df.groupby('ID').cumcount()//2
m = (df.merge(df.assign(Action=df.Action.map({'P': 'R', 'R': 'P'})),
on=['ID', 'idx', 'Action'], suffixes=['_P', '_R'])
.drop_duplicates(['ID', 'idx']))
m.groupby(['Station_P', 'Station_R']).size()

Station_P  Station_R
S1         S2           2
S2         S1           3
dtype: int64

作为参考，m看起来像

ID Action Station_P  idx Station_R
0   1      P        S1    0        S2
2   1      P        S1    1        S2
4   2      P        S2    0        S1
6   2      P        S2    1        S1
8   3      P        S2    0        S1

一种方法是按cumsum()对P,R进行分组，并使用cumcount：

(df.assign(order=df.Action.eq('P')
.groupby(df['ID'])  # this might not be necessary
.cumsum())
.groupby(['ID', 'order'])
.Station.agg(tuple)
.groupby('ID').value_counts()
)

输出：

ID  Station 
1   (S1, S2)    2
2   (S2, S1)    2
3   (S2, S1)    1
Name: Station, dtype: int64

相关内容

最新更新

热门标签：