python panda groupby并消除重复



我想去尽可能少的商店买我的产品。我该怎么做?我有一份特定产品的商店清单。

wanted_Products = pd.DataFrame({'p':[1,2,3,4,5,6,7]})
stores = pd.DataFrame({'Store': np.repeat(np.arange(1,5),4),
'Product': [1,2,3,5,0,2,3,4,0,6,7,8,0,1,2,6]})
# return 1 if the Product is wanted
stores['Wanted'] = stores.Product.isin(wanted_Products.p).values.astype(int)
Store  Product  Wanted
0       1        1       1
1       1        2       1
2       1        3       1
3       1        5       1
4       2        0       0
5       2        2       1
6       2        3       1
7       2        4       1
8       3        0       0
9       3        6       1
10      3        7       1
11      3        8       0
12      4        0       0
13      4        1       1
14      4        2       1
15      4        6       1
# Group products per store and calculate how many wanted products are in a store
w = stores.groupby('Store', as_index=False).agg(list)
w['Number_wanted'] = stores.groupby('Store', as_index=False)['Wanted'].sum().agg(list)['Wanted']
Store  Product        Wanted         Number_wanted  ?Products_wanted?
0      1  [1, 2, 3, 5]  [1, 1, 1, 1]              4            [1,2,3,5]
1      2  [0, 2, 3, 4]  [0, 1, 1, 1]              3            [2,3,4]
2      3  [0, 6, 7, 8]  [0, 1, 1, 0]              2            [6,7]
3      4  [0, 1, 2, 6]  [0, 1, 1, 1]              3            [1,2,6]

如果没有不需要的产品,我如何在新列(Products_wanted(中获得我想要的产品?当我使用isin((时,我只得到true/false(1/0,如果我使用astype(int((,而不是实际的数字。

实现这一点的一种方法是跟踪商店中可用的所有产品,获取它们,然后将这些产品标记为"取">这样你就不会在下一家商店里挑选同样的了。

因此,最初您有wanted_Products= [1,2,3,4,5,6,7]由于您从商店1获得[1, 2, 3, 5],因此您选择并返回这些产品作为要从>商店1获得的产品,然后将所有这些标记为"取">只需将wanted_Products中的这些值替换为其他值,如-1(或您喜欢的其他值,表示它们已被取下(

现在wanted_Products= [-1,-1,-1,4,-1,6,7]-1的产品被拿走了,所以你只能从下一家商店买到[4,6,7]的产品。对所有商店重复相同的逻辑将为您提供从那里获得的产品,而不会有任何重复:

def get_products(possible, wanted):
i = np.where(np.in1d(wanted, possible))
available = wanted[i]
wanted[i] = -1
return available
w = stores.groupby('Store', as_index=False).agg(list)
w['Products to get'] = w.Product.apply(get_products, args=(np.array(wanted_Products),))

输出:

>>> w
Store       Product Products to get
0      1  [1, 2, 3, 5]    [1, 2, 3, 5]
1      2  [0, 2, 3, 4]             [4]
2      3  [0, 6, 7, 8]          [6, 7]
3      4  [0, 1, 2, 6]              []

为了遵守您的优化标准(始终从具有列表中最多的产品(,每个商店的产品列表每次迭代都需要在上再次对进行排序:每次您决定从给定的商店获得一组产品,剩余的列表需要清理(移除已购买的产品(并按长度重新订购。

作为技术说明,我将把您的列表转换为集合,因为您不希望重复,所以可以这样做,并且它为我们提供了集合运算:交集(检查给定商店中有哪些想要的产品(和差异(从通缉名单中删除已购买的产品。(

代码不太优雅,但我包含了不少评论:

stores = pd.DataFrame({'Store': np.repeat(np.arange(1,5),4),
'Product': [1,2,3,5,0,2,3,4,0,6,7,8,0,1,2,6]})
# stores = pd.DataFrame({'Store': np.repeat(np.arange(1,5),4),
#                    'Product': [0,2,7,6,0,2,4,8,1,2,7,6,1,2,3,5]})
w = stores.groupby('Store', as_index=False).agg(list)
w['Products to get'] = np.nan
w['Products to get'] = w['Products to get'].astype('object')
wanted_Products = [1,2,3,4,5,6,7]
wanted = set(wanted_Products)
tmp = w[['Store', 'Product']]
while len(wanted) > 0:
# Removed unwanted products (set intersection)
tmp['Product'] = tmp.Product.apply(lambda x: set(x) & wanted)

# Sort on length of product sets
tmp['lengths'] = tmp.Product.str.len()
tmp = tmp.sort_values(by='lengths', ascending=False).drop('lengths', 1)
# Get products from this store, remove them from wanted set
get = tmp.loc[tmp.index[0], 'Product'] & wanted
wanted -= get
# Update Products to get for this store
row = w[w['Store'] == tmp.loc[tmp.index[0], 'Store']]
w.at[row.index[0], 'Products to get'] = get
# Remove the largest product set, work on the remaining ones
tmp = tmp.iloc[1:, ]

这是输出:

In [71]: w
Out[71]: 
Store       Product Products to get
0      1  [1, 2, 3, 5]    {1, 2, 3, 5}
1      2  [0, 2, 3, 4]             {4}
2      3  [0, 6, 7, 8]          {6, 7}
3      4  [0, 1, 2, 6]             NaN

随着3号和4号商店的产品越来越多,它仍然有效:

stores = pd.DataFrame({'Store': np.repeat(np.arange(1,5),4),
'Product': [0,2,7,6,0,2,4,8,1,2,7,6,1,2,3,5]})

输出为:

In [76]: w
Out[76]: 
Store       Product Products to get
0      1  [0, 2, 7, 6]             NaN
1      2  [0, 2, 4, 8]             {4}
2      3  [1, 2, 7, 6]    {1, 2, 6, 7}
3      4  [1, 2, 3, 5]          {3, 5}

最新更新