python panda groupby并消除重复

我想去尽可能少的商店买我的产品。我该怎么做？我有一份特定产品的商店清单。

wanted_Products = pd.DataFrame({'p':[1,2,3,4,5,6,7]})
stores = pd.DataFrame({'Store': np.repeat(np.arange(1,5),4),
'Product': [1,2,3,5,0,2,3,4,0,6,7,8,0,1,2,6]})
# return 1 if the Product is wanted
stores['Wanted'] = stores.Product.isin(wanted_Products.p).values.astype(int)
Store  Product  Wanted
0       1        1       1
1       1        2       1
2       1        3       1
3       1        5       1
4       2        0       0
5       2        2       1
6       2        3       1
7       2        4       1
8       3        0       0
9       3        6       1
10      3        7       1
11      3        8       0
12      4        0       0
13      4        1       1
14      4        2       1
15      4        6       1
# Group products per store and calculate how many wanted products are in a store
w = stores.groupby('Store', as_index=False).agg(list)
w['Number_wanted'] = stores.groupby('Store', as_index=False)['Wanted'].sum().agg(list)['Wanted']
Store  Product        Wanted         Number_wanted  ?Products_wanted?
0      1  [1, 2, 3, 5]  [1, 1, 1, 1]              4            [1,2,3,5]
1      2  [0, 2, 3, 4]  [0, 1, 1, 1]              3            [2,3,4]
2      3  [0, 6, 7, 8]  [0, 1, 1, 0]              2            [6,7]
3      4  [0, 1, 2, 6]  [0, 1, 1, 1]              3            [1,2,6]

如果没有不需要的产品，我如何在新列(Products_wanted(中获得我想要的产品？当我使用isin((时，我只得到true/false(1/0，如果我使用astype(int((，而不是实际的数字。

实现这一点的一种方法是跟踪商店中可用的所有产品，获取它们，然后将这些产品标记为"取">这样你就不会在下一家商店里挑选同样的了。

因此，最初您有wanted_Products= [1,2,3,4,5,6,7]由于您从商店1获得[1, 2, 3, 5]，因此您选择并返回这些产品作为要从>商店1获得的产品，然后将所有这些标记为"取">只需将wanted_Products中的这些值替换为其他值，如-1(或您喜欢的其他值，表示它们已被取下(。

现在wanted_Products= [-1,-1,-1,4,-1,6,7]。-1的产品被拿走了，所以你只能从下一家商店买到[4,6,7]的产品。对所有商店重复相同的逻辑将为您提供从那里获得的产品，而不会有任何重复：

def get_products(possible, wanted):
i = np.where(np.in1d(wanted, possible))
available = wanted[i]
wanted[i] = -1
return available
w = stores.groupby('Store', as_index=False).agg(list)
w['Products to get'] = w.Product.apply(get_products, args=(np.array(wanted_Products),))

输出：

>>> w
Store       Product Products to get
0      1  [1, 2, 3, 5]    [1, 2, 3, 5]
1      2  [0, 2, 3, 4]             [4]
2      3  [0, 6, 7, 8]          [6, 7]
3      4  [0, 1, 2, 6]              []

为了遵守您的优化标准(始终从具有列表中最多的产品(，每个商店的产品列表每次迭代都需要在上再次对进行排序：每次您决定从给定的商店获得一组产品，剩余的列表需要清理(移除已购买的产品(并按长度重新订购。

作为技术说明，我将把您的列表转换为集合，因为您不希望重复，所以可以这样做，并且它为我们提供了集合运算：交集(检查给定商店中有哪些想要的产品(和差异(从通缉名单中删除已购买的产品。(

代码不太优雅，但我包含了不少评论：

stores = pd.DataFrame({'Store': np.repeat(np.arange(1,5),4),
'Product': [1,2,3,5,0,2,3,4,0,6,7,8,0,1,2,6]})
# stores = pd.DataFrame({'Store': np.repeat(np.arange(1,5),4),
#                    'Product': [0,2,7,6,0,2,4,8,1,2,7,6,1,2,3,5]})
w = stores.groupby('Store', as_index=False).agg(list)
w['Products to get'] = np.nan
w['Products to get'] = w['Products to get'].astype('object')
wanted_Products = [1,2,3,4,5,6,7]
wanted = set(wanted_Products)
tmp = w[['Store', 'Product']]
while len(wanted) > 0:
# Removed unwanted products (set intersection)
tmp['Product'] = tmp.Product.apply(lambda x: set(x) & wanted)

# Sort on length of product sets
tmp['lengths'] = tmp.Product.str.len()
tmp = tmp.sort_values(by='lengths', ascending=False).drop('lengths', 1)
# Get products from this store, remove them from wanted set
get = tmp.loc[tmp.index[0], 'Product'] & wanted
wanted -= get
# Update Products to get for this store
row = w[w['Store'] == tmp.loc[tmp.index[0], 'Store']]
w.at[row.index[0], 'Products to get'] = get
# Remove the largest product set, work on the remaining ones
tmp = tmp.iloc[1:, ]

这是输出：

In [71]: w
Out[71]: 
Store       Product Products to get
0      1  [1, 2, 3, 5]    {1, 2, 3, 5}
1      2  [0, 2, 3, 4]             {4}
2      3  [0, 6, 7, 8]          {6, 7}
3      4  [0, 1, 2, 6]             NaN

随着3号和4号商店的产品越来越多，它仍然有效：

stores = pd.DataFrame({'Store': np.repeat(np.arange(1,5),4),
'Product': [0,2,7,6,0,2,4,8,1,2,7,6,1,2,3,5]})

输出为：

In [76]: w
Out[76]: 
Store       Product Products to get
0      1  [0, 2, 7, 6]             NaN
1      2  [0, 2, 4, 8]             {4}
2      3  [1, 2, 7, 6]    {1, 2, 6, 7}
3      4  [1, 2, 3, 5]          {3, 5}

相关内容

最新更新

热门标签：