使用类似格式的csv(总csv为~500 x ~600,000(,因此缺少列:
Sales market_id product_id
0 38 10001516 1132679
1 49 10001516 1138767
2 6 10001516 1132679
... ... ...
9969 245732 1002123 1383020
9970 247093 1006821 1383020
等并像这样阅读: df0=pd.read_csv('all_final_decomps2_small.csv', low_memory=False, encoding='iso8859_15')
我正在尝试找到每个market_id销售额最大的product_id。为此,我需要将销售额汇总为同一product_id,market_id可以出现在多行中。
我已经尝试过这个,它产生了每个市场中的产品总和:
df_sales=df0[['Sales','market_id','product_id']]
df_sales.groupby(['market_id', 'product_id'])['Sales'].sum()
如此(缩短(:
market_id product_id
1006174 1132679 2789
1382460 4586
1382691 49
1383020 269138089
1006638 1132679 5143156
1382460 387250
1383020 204456809
10002899 1132679 630
1382464 220
用:
df_sales.groupby(['market_id', 'product_id'])['Sales'].sum().max()
返回总和的最大值,不返回任何其他值,因此在这种情况下,它将返回269138089。我想返回这样的东西:
market_id product_id max_sales
1006174 1383020 269138089
1006638 1383020 204456809
10002899 1132679 630
我已经尝试了很多不同的东西,但我似乎无法为这个例子提供任何东西,所以我将不胜感激任何帮助(如果之前似乎有人问过,我很抱歉(。
我正在使用:Python 3.6.1 :: Anaconda 4.4.0 (64位(
在groupby
中使用idxmax
设置
import pandas as pd
from io import StringIO
txt = """market_id product_id Sales
1006174 1132679 2789
1006174 1382460 4586
1006174 1382691 49
1006174 1383020 269138089
1006638 1132679 5143156
1006638 1382460 387250
1006638 1383020 204456809
10002899 1132679 630
10002899 1382464 220"""
sales = pd.read_csv(StringIO(txt), delim_whitespace=True, index_col=[0, 1], squeeze=True)
溶液
sales.loc[sales.groupby(level=0).idxmax()]
market_id product_id
1006174 1383020 269138089
1006638 1383020 204456809
10002899 1132679 630
Name: Sales, dtype: int64
或
sales.loc[sales.groupby(level=0).idxmax()].reset_index(name='max_sales')
market_id product_id max_sales
0 1006174 1383020 269138089
1 1006638 1383020 204456809
2 10002899 1132679 630
不知何故设法得到了这个 - 我不确定这是否是最好的方法,但它适用于我的数据:
df0=pd.read_csv('test.csv', low_memory=False, encoding='iso8859_15')
#Rank all items in each market by total sales
df_sales=df0[['Sales', 'market_id', 'product_id']] # int, int, int
# groups sales by market and product and sums product sales
gr_sales = df_sales.groupby(['market_id', 'product_id'], as_index = False).sum()
# gets the product sales in each market and sorts in order of decreasing sales
gr_sales = gr_sales.groupby('market_id').apply(pd.DataFrame.sort_values, 'Sales', ascending = False)
# Finds the product id with the maximum sales in each market
max_sales = gr_sales.groupby('market_id').max()
给我:
In[621]: max_sales
Out[621]:
market_id product_id Sales
0 1006174 1383020 269138089
1 1006638 1383020 1330070614
2 1006678 1383020 58548417
3 1006684 1383020 215858049
4 1006692 1383020 21799689
5 1006732 1383020 58548417
6 1006733 1383020 58548417
7 1006739 1383020 215858049
8 1006819 1383020 605951504
9 1006820 1383020 59083807
10 1006821 1383020 25116872
11 1050511 1382672 6201692
12 1050512 1382672 5468317
13 10001493 1383020 21799689
14 10001516 1383020 204456809
15 10002899 1383020 62413425
和(缩短示例(:
In[624]: gr_sales
Out[624]:
market_id product_id Sales
market_id
1006174 11 1006174 1383020 269138089
9 1006174 1382672 5070111
5 1006174 1382536 2442639
7 1006174 1382602 1108361
6 1006174 1382557 158488
8 1006174 1382651 17214
1 1006174 1382460 4586
0 1006174 1132679 2789
3 1006174 1382490 799
2 1006174 1382464 105
10 1006174 1382691 49
4 1006174 1382522 16
1006638 28 1006638 1383020 1330070614
25 1006638 1382672 109679596
12 1006638 1132679 5143156
17 1006638 1382536 4885278
22 1006638 1382620 2668948
21 1006638 1382602 2216722
18 1006638 1382538 992228
13 1006638 1382460 387250
19 1006638 1382557 316976
23 1006638 1382651 39616
26 1006638 1382674 22388
20 1006638 1382573 7412
15 1006638 1382490 1598
14 1006638 1382464 758
24 1006638 1382665 120
27 1006638 1382691 98
16 1006638 1382522 32
1006678 32 1006678 1383020 58548417
... ... ...
[117 rows x 3 columns]
我不确定如何从gr_sales输出中删除任意索引(中间这样有点烦人(,或者从max_sales表中删除任意索引