为什么在groupby和sum中不提及列会被删除

我有这个数据帧：

InvoiceID   PaymentDate          TotalRevenue   Discount     Discount_Revenue
0   72A04E22    2020-07-03 17:25:13   1650000.0      0.0          1650000.0
1   54FCFCB9    2021-03-17 14:26:08   5500000.0      0.0          5500000.0
...

在以下聚合之后，删除列PaymentDate：

df.groupby(by=['InvoiceID'])[['TotalRevenue','Discount','Discount_Revenue']].sum().reset_index(drop=True, inplace=True)

如何仍然保留groupby或聚合函数中未提及的列？

当您使用sum执行groupby时，这意味着您正在聚合数据：从具有相同InvoiceID的多行中，您只保留一行，即df中所有行的值之和。

假设这是同一行的数据帧两次：

InvoiceID          PaymentDate  TotalRevenue  Discount  Discount_Revenue
0  72A04E22  2020-07-03 17:25:13     1650000.0       0.0         1650000.0
1  54FCFCB9  2021-03-17 14:26:08     5500000.0       0.0         5500000.0
2  54FCFCB9  2021-03-17 14:26:08     5500000.0       1.0         5500000.0

然后你可以看到这种对Discount求和的影响，例如：

>>> df.groupby('InvoiceID')['Discount'].sum()
InvoiceID
54FCFCB9    1.0
72A04E22    0.0
Name: Discount, dtype: float64

具体回答您的问题：列PaymentDate被删除，因为您没有指定如何聚合它

对于没有意义添加的列，例如PaymentDate，您需要定义另一个要使用的聚合函数。你想保留第一个付款日期吗？最后一个？
请注意，InvoiceID并没有在上面的例子中消失，您有意在使用.reset_index(drop=True)的代码中删除它

假设我们选择保留最后一个付款日期，然后使用不带drop=True的reset_index来保留InvoiceID，我们有：

>>> invoice_groups = df.groupby('InvoiceID')
>>> invoices = invoice_groups.sum().join(invoice_groups['PaymentDate'].max()).reset_index()
>>> invoices
InvoiceID  TotalRevenue  Discount  Discount_Revenue         PaymentDate
0  54FCFCB9    11000000.0       1.0        11000000.0 2021-03-17 14:26:08
1  72A04E22     1650000.0       0.0         1650000.0 2020-07-03 17:25:13

这就是所有列，所有列都以某种方式(sum或max(从原始数据帧中的行聚合而来。

相关内容

最新更新

热门标签：