熊猫棘手的群组比单列中的列表中的多个组



树桩。想象一个包含多个产品类别"标签"的列的数据框,其中一些在列表中,例如So。

import pandas as pd
raw = {
    'Products' : ['Rock On Leather Journal',
                  'Beats Earbuds In Ear Timer',
                  'Garmin 25mm Wristwatch'
                 ],
    'Product Cost': [55,163,200],
    'Product Category' : [['Music','Journals','Paper'], 
                          ['Headphones','Music', 'Clocks'], 
                          ['Watches','Clocks']]
}
data = pd.DataFrame(raw)

获取每个类别中多少产品的最佳方法以及平均每个类别的成本?例如,

音乐:count-?,avg价格 - ?

通常,这是使用地图或组比的直接过程。但是类别列中列表的存在增加了一个棘手的转折。

源DF:

In [21]: data
Out[21]:
              Product Category  Product Cost                    Products
0     [Music, Journals, Paper]            55     Rock On Leather Journal
1  [Headphones, Music, Clocks]           163  Beats Earbuds In Ear Timer
2            [Watches, Clocks]           200      Garmin 25mm Wristwatch

首先让其转换(Flatte)为以下DF:

In [22]: lst_col = 'Product Category'
    ...:
    ...: x = pd.DataFrame({
    ...:     col:np.repeat(data[col].values, data[lst_col].str.len())
    ...:     for col in data.columns.difference([lst_col])
    ...: }).assign(**{lst_col:np.concatenate(data[lst_col].values)})[data.columns.tolist()]
    ...:
In [23]: x
Out[23]:
  Product Category  Product Cost                    Products
0            Music            55     Rock On Leather Journal
1         Journals            55     Rock On Leather Journal
2            Paper            55     Rock On Leather Journal
3       Headphones           163  Beats Earbuds In Ear Timer
4            Music           163  Beats Earbuds In Ear Timer
5           Clocks           163  Beats Earbuds In Ear Timer
6          Watches           200      Garmin 25mm Wristwatch
7           Clocks           200      Garmin 25mm Wristwatch

现在我们可以轻松"count of how many Products within each Category, and to average the costs for each category"

In [25]: x.groupby('Product Category')['Product Cost'].agg(['size', 'mean']).reset_index()
Out[25]:
  Product Category  size   mean
0           Clocks     2  181.5
1       Headphones     1  163.0
2         Journals     1   55.0
3            Music     2  109.0
4            Paper     1   55.0
5          Watches     1  200.0

一些解释:

每行列表元素的数量:

In [7]: data[lst_col].str.len()
Out[7]:
0    3
1    3
2    2
Name: Product Category, dtype: int64

使用此信息,我们可以复制所有非列表列,如下所示:

In [3]: x = pd.DataFrame({
   ...:     col:np.repeat(data[col].values, data[lst_col].str.len())
   ...:     for col in data.columns.difference([lst_col])
   ...: })
In [4]: x
Out[4]:
   Product Cost                    Products
0            55     Rock On Leather Journal
1            55     Rock On Leather Journal
2            55     Rock On Leather Journal
3           163  Beats Earbuds In Ear Timer
4           163  Beats Earbuds In Ear Timer
5           163  Beats Earbuds In Ear Timer
6           200      Garmin 25mm Wristwatch
7           200      Garmin 25mm Wristwatch

现在我们可以添加扁平的list column

In [8]: np.concatenate(data[lst_col].values)
Out[8]:
array(['Music', 'Journals', 'Paper', 'Headphones', 'Music', 'Clocks', 'Watches', 'Clocks'],
      dtype='<U10')
In [5]: x.assign(**{lst_col:np.concatenate(data[lst_col].values)})
Out[5]:
   Product Cost                    Products Product Category
0            55     Rock On Leather Journal            Music
1            55     Rock On Leather Journal         Journals
2            55     Rock On Leather Journal            Paper
3           163  Beats Earbuds In Ear Timer       Headphones
4           163  Beats Earbuds In Ear Timer            Music
5           163  Beats Earbuds In Ear Timer           Clocks
6           200      Garmin 25mm Wristwatch          Watches
7           200      Garmin 25mm Wristwatch           Clocks

最后,我们简单地按原始顺序选择列:

In [6]: x.assign(**{lst_col:np.concatenate(data[lst_col].values)})[data.columns.tolist()]
Out[6]:
  Product Category  Product Cost                    Products
0            Music            55     Rock On Leather Journal
1         Journals            55     Rock On Leather Journal
2            Paper            55     Rock On Leather Journal
3       Headphones           163  Beats Earbuds In Ear Timer
4            Music           163  Beats Earbuds In Ear Timer
5           Clocks           163  Beats Earbuds In Ear Timer
6          Watches           200      Garmin 25mm Wristwatch
7           Clocks           200      Garmin 25mm Wristwatch

这取决于您的情况。如果是这样的大小,您可能需要为每个元素创建一个布尔列 - 例如

unique_products = set(chain(*data['Product Category']))
for product in unique_products:
    data['product_{}.format(product) = data.Products.apply(lambda x: product in x)

或者,如果您有大量产品,请继续使用data.Product.apply(lamba x: product in x)。您也可以使用data.Product.isin([product_one, product_two])执行类似的检查。

拥有合成列后,可以将其用于合并。

最新更新