在两列 ~Python 上使用 groupby() 时获取每个组的最大值



使用类似格式的csv(总csv为~500 x ~600,000(,因此缺少列:

       Sales  market_id  product_id
0         38   10001516     1132679
1         49   10001516     1138767
2          6   10001516     1132679
     ...        ...         ...
9969  245732    1002123     1383020
9970  247093    1006821     1383020

等并像这样阅读: df0=pd.read_csv('all_final_decomps2_small.csv', low_memory=False, encoding='iso8859_15')

我正在尝试找到每个market_id销售额最大的product_id。为此,我需要将销售额汇总为同一product_id,market_id可以出现在多行中。

我已经尝试过这个,它产生了每个市场中的产品总和:

df_sales=df0[['Sales','market_id','product_id']] 
df_sales.groupby(['market_id', 'product_id'])['Sales'].sum()

如此(缩短(:

market_id  product_id
1006174    1132679             2789
           1382460             4586
           1382691               49
           1383020        269138089
1006638    1132679          5143156
           1382460           387250
           1383020        204456809
10002899   1132679              630
           1382464              220

用:

df_sales.groupby(['market_id', 'product_id'])['Sales'].sum().max()
返回

总和的最大值,不返回任何其他值,因此在这种情况下,它将返回269138089。我想返回这样的东西:

market_id  product_id      max_sales
1006174    1383020        269138089
1006638    1383020        204456809
10002899   1132679              630

我已经尝试了很多不同的东西,但我似乎无法为这个例子提供任何东西,所以我将不胜感激任何帮助(如果之前似乎有人问过,我很抱歉(。

我正在使用:Python 3.6.1 :: Anaconda 4.4.0 (64位(

groupby中使用idxmax

设置

import pandas as pd
from io import StringIO
txt = """market_id  product_id         Sales
1006174    1132679             2789
1006174    1382460             4586
1006174    1382691               49
1006174    1383020        269138089
1006638    1132679          5143156
1006638    1382460           387250
1006638    1383020        204456809
10002899   1132679              630
10002899   1382464              220"""

sales = pd.read_csv(StringIO(txt), delim_whitespace=True, index_col=[0, 1], squeeze=True)

溶液

sales.loc[sales.groupby(level=0).idxmax()]
market_id  product_id
1006174    1383020       269138089
1006638    1383020       204456809
10002899   1132679             630
Name: Sales, dtype: int64

sales.loc[sales.groupby(level=0).idxmax()].reset_index(name='max_sales')
   market_id  product_id  max_sales
0    1006174     1383020  269138089
1    1006638     1383020  204456809
2   10002899     1132679        630

不知何故设法得到了这个 - 我不确定这是否是最好的方法,但它适用于我的数据:

df0=pd.read_csv('test.csv', low_memory=False, encoding='iso8859_15')
#Rank all items in each market by total sales
df_sales=df0[['Sales', 'market_id', 'product_id']] # int, int, int
# groups sales by market and product and sums product sales
gr_sales = df_sales.groupby(['market_id', 'product_id'], as_index = False).sum()
# gets the product sales in each market and sorts in order of decreasing sales
gr_sales = gr_sales.groupby('market_id').apply(pd.DataFrame.sort_values, 'Sales', ascending = False)
# Finds the product id with the maximum sales in each market
max_sales = gr_sales.groupby('market_id').max() 

给我:

In[621]: max_sales
Out[621]: 
    market_id  product_id       Sales
0     1006174     1383020   269138089
1     1006638     1383020  1330070614
2     1006678     1383020    58548417
3     1006684     1383020   215858049
4     1006692     1383020    21799689
5     1006732     1383020    58548417
6     1006733     1383020    58548417
7     1006739     1383020   215858049
8     1006819     1383020   605951504
9     1006820     1383020    59083807
10    1006821     1383020    25116872
11    1050511     1382672     6201692
12    1050512     1382672     5468317
13   10001493     1383020    21799689
14   10001516     1383020   204456809
15   10002899     1383020    62413425

和(缩短示例(:

In[624]: gr_sales
Out[624]: 
               market_id  product_id       Sales
market_id                                       
1006174   11     1006174     1383020   269138089
          9      1006174     1382672     5070111
          5      1006174     1382536     2442639
          7      1006174     1382602     1108361
          6      1006174     1382557      158488
          8      1006174     1382651       17214
          1      1006174     1382460        4586
          0      1006174     1132679        2789
          3      1006174     1382490         799
          2      1006174     1382464         105
          10     1006174     1382691          49
          4      1006174     1382522          16
1006638   28     1006638     1383020  1330070614
          25     1006638     1382672   109679596
          12     1006638     1132679     5143156
          17     1006638     1382536     4885278
          22     1006638     1382620     2668948
          21     1006638     1382602     2216722
          18     1006638     1382538      992228
          13     1006638     1382460      387250
          19     1006638     1382557      316976
          23     1006638     1382651       39616
          26     1006638     1382674       22388
          20     1006638     1382573        7412
          15     1006638     1382490        1598
          14     1006638     1382464         758
          24     1006638     1382665         120
          27     1006638     1382691          98
          16     1006638     1382522          32
1006678   32     1006678     1383020    58548417
                 ...         ...         ...
[117 rows x 3 columns]

我不确定如何从gr_sales输出中删除任意索引(中间这样有点烦人(,或者从max_sales表中删除任意索引

最新更新