树桩。想象一个包含多个产品类别"标签"的列的数据框,其中一些在列表中,例如So。
import pandas as pd
raw = {
'Products' : ['Rock On Leather Journal',
'Beats Earbuds In Ear Timer',
'Garmin 25mm Wristwatch'
],
'Product Cost': [55,163,200],
'Product Category' : [['Music','Journals','Paper'],
['Headphones','Music', 'Clocks'],
['Watches','Clocks']]
}
data = pd.DataFrame(raw)
获取每个类别中多少产品的最佳方法以及平均每个类别的成本?例如,
音乐:count-?,avg价格 - ?
通常,这是使用地图或组比的直接过程。但是类别列中列表的存在增加了一个棘手的转折。
源DF:
In [21]: data
Out[21]:
Product Category Product Cost Products
0 [Music, Journals, Paper] 55 Rock On Leather Journal
1 [Headphones, Music, Clocks] 163 Beats Earbuds In Ear Timer
2 [Watches, Clocks] 200 Garmin 25mm Wristwatch
首先让其转换(Flatte)为以下DF:
In [22]: lst_col = 'Product Category'
...:
...: x = pd.DataFrame({
...: col:np.repeat(data[col].values, data[lst_col].str.len())
...: for col in data.columns.difference([lst_col])
...: }).assign(**{lst_col:np.concatenate(data[lst_col].values)})[data.columns.tolist()]
...:
In [23]: x
Out[23]:
Product Category Product Cost Products
0 Music 55 Rock On Leather Journal
1 Journals 55 Rock On Leather Journal
2 Paper 55 Rock On Leather Journal
3 Headphones 163 Beats Earbuds In Ear Timer
4 Music 163 Beats Earbuds In Ear Timer
5 Clocks 163 Beats Earbuds In Ear Timer
6 Watches 200 Garmin 25mm Wristwatch
7 Clocks 200 Garmin 25mm Wristwatch
现在我们可以轻松"count of how many Products within each Category, and to average the costs for each category"
:
In [25]: x.groupby('Product Category')['Product Cost'].agg(['size', 'mean']).reset_index()
Out[25]:
Product Category size mean
0 Clocks 2 181.5
1 Headphones 1 163.0
2 Journals 1 55.0
3 Music 2 109.0
4 Paper 1 55.0
5 Watches 1 200.0
一些解释:
每行列表元素的数量:
In [7]: data[lst_col].str.len()
Out[7]:
0 3
1 3
2 2
Name: Product Category, dtype: int64
使用此信息,我们可以复制所有非列表列,如下所示:
In [3]: x = pd.DataFrame({
...: col:np.repeat(data[col].values, data[lst_col].str.len())
...: for col in data.columns.difference([lst_col])
...: })
In [4]: x
Out[4]:
Product Cost Products
0 55 Rock On Leather Journal
1 55 Rock On Leather Journal
2 55 Rock On Leather Journal
3 163 Beats Earbuds In Ear Timer
4 163 Beats Earbuds In Ear Timer
5 163 Beats Earbuds In Ear Timer
6 200 Garmin 25mm Wristwatch
7 200 Garmin 25mm Wristwatch
现在我们可以添加扁平的list column
:
In [8]: np.concatenate(data[lst_col].values)
Out[8]:
array(['Music', 'Journals', 'Paper', 'Headphones', 'Music', 'Clocks', 'Watches', 'Clocks'],
dtype='<U10')
In [5]: x.assign(**{lst_col:np.concatenate(data[lst_col].values)})
Out[5]:
Product Cost Products Product Category
0 55 Rock On Leather Journal Music
1 55 Rock On Leather Journal Journals
2 55 Rock On Leather Journal Paper
3 163 Beats Earbuds In Ear Timer Headphones
4 163 Beats Earbuds In Ear Timer Music
5 163 Beats Earbuds In Ear Timer Clocks
6 200 Garmin 25mm Wristwatch Watches
7 200 Garmin 25mm Wristwatch Clocks
最后,我们简单地按原始顺序选择列:
In [6]: x.assign(**{lst_col:np.concatenate(data[lst_col].values)})[data.columns.tolist()]
Out[6]:
Product Category Product Cost Products
0 Music 55 Rock On Leather Journal
1 Journals 55 Rock On Leather Journal
2 Paper 55 Rock On Leather Journal
3 Headphones 163 Beats Earbuds In Ear Timer
4 Music 163 Beats Earbuds In Ear Timer
5 Clocks 163 Beats Earbuds In Ear Timer
6 Watches 200 Garmin 25mm Wristwatch
7 Clocks 200 Garmin 25mm Wristwatch
这取决于您的情况。如果是这样的大小,您可能需要为每个元素创建一个布尔列 - 例如
unique_products = set(chain(*data['Product Category']))
for product in unique_products:
data['product_{}.format(product) = data.Products.apply(lambda x: product in x)
或者,如果您有大量产品,请继续使用data.Product.apply(lamba x: product in x)
。您也可以使用data.Product.isin([product_one, product_two])
执行类似的检查。
拥有合成列后,可以将其用于合并。