如何对pandas DataFrame进行筛选和分组，以获得两列组合的计数

很抱歉，我不能以简洁的方式将整个问题放在标题中。请原谅我的英语。我将举例说明我的问题。

假设我有这个数据集：

dff = pd.DataFrame(np.array([["2020-11-13", 0, 3,4], ["2020-10-11", 1, 3,4], ["2020-11-13", 2, 1,4],
["2020-11-14", 0, 3,4], ["2020-11-13", 1, 5,4], 
["2020-11-14", 2, 2,4],["2020-11-12", 1, 1,4],["2020-11-14", 1, 2,4],["2020-11-15", 2, 5,4],
["2020-11-11", 0, 1,2],["2020-11-15", 1, 1,2],
["2020-11-18", 1, 2,4],["2020-11-17", 0, 1,2],["2020-11-20", 0, 3,4]]), columns=['Timestamp', 'ID', 'Name', "slot"])

我希望每个Name和slot组合都有一个计数，但忽略相同ID的多个时间序列值。例如，如果我只是按Name和slot分组，我得到：

dff.groupby(['Name', "slot"]).Timestamp.count().reset_index(name="count")

Name slot count
1   2   3
1   4   2
2   4   3
3   4   4
5   4   2

然而，对于ID == 0，name == 1和slot == 2有两个组合，因此我希望计数为2，而不是3。

这是我理想中想要的桌子。

Name slot count
1   2   2
1   4   2
2   4   2
3   4   2
5   4   2

我试过了：

filter_one = dff.groupby(['ID']).Timestamp.transform(min)
dff1 = dff.loc[dff.Timestamp == filter_one]
dff1.groupby(['Name', "slot"]).Timestamp.count().reset_index(name="count")

但这给了我：

Name slot count
1   2   1
1   4   1
3   4   1

如果我删除ID的重复项，它也不起作用。

x = dff.groupby(["Name", "slot"]).ID.nunique().reset_index(name="count")
print(x)

打印：

Name slot  count
0    1    2      2
1    1    4      2
2    2    4      2
3    3    4      2
4    5    4      2

相关内容

最新更新

热门标签：