统计数据帧列中多个元素的出现次数




以下是我正在处理的数据集:

{'type': {0: 'TV Show', 1: 'Movie', 2: 'Movie', 3: 'Movie', 4: 'Movie'},
'title': {0: '3%', 1: '7:19', 2: '23:59', 3: '9', 4: '21'},
'director': {0: nan,
1: 'Jorge Michel Grau',
2: 'Gilbert Chan',
3: 'Shane Acker',
4: 'Robert Luketic'},
'cast': {0: 'João Miguel, Bianca Comparato, Michel Gomes, Rodolfo Valente, Vaneza Oliveira, Rafael Lozano, Viviane Porto, Mel Fronckowiak, Sergio Mamberti, Zezé Motta, Celso Frateschi',
1: 'Demián Bichir, Héctor Bonilla, Oscar Serrano, Azalia Ortiz, Octavio Michel, Carmen Beato',
2: 'Tedd Chan, Stella Chung, Henley Hii, Lawrence Koh, Tommy Kuan, Josh Lai, Mark Lee, Susan Leong, Benjamin Lim',
3: 'Elijah Wood, John C. Reilly, Jennifer Connelly, Christopher Plummer, Crispin Glover, Martin Landau, Fred Tatasciore, Alan Oppenheimer, Tom Kane',
4: 'Jim Sturgess, Kevin Spacey, Kate Bosworth, Aaron Yoo, Liza Lapira, Jacob Pitts, Laurence Fishburne, Jack McGee, Josh Gad, Sam Golzari, Helen Carey, Jack Gilpin'}}

我想做的是提取每个演员在电影/电视节目中的参与次数。我创建了几个函数来从数据中分离参与者的名称:

def quick_extract(x) :
z = re.split(',', x)
for n, i in enumerate(z):
z[n] = z[n].strip(' ')
return z
def unique(data, column) :
x = data[column].fillna('missing')
s = []
for l in x.apply(lambda z: quick_extract(z)) :
s+=list(l)
s = [i for i in s if len(i)>0] 
return set(s)
unique(data, 'cast') # there are 32882 distinct actors in the dataset

为了计算参与者的参与次数,我使用了如下的for循环:

for l in unique(data, 'cast') : 
print(l, data['cast'].fillna('missing').apply(lambda z: l in z).sum())

但是由于它的迭代次数为32882次,所以它花费了太多时间。迭代方法通常需要花费大量时间,在这种情况下,有其他有效的方法可以花费更少的时间吗?

您的整个方法效率非常低。只需使用collections.Counter对象。假设你有一个类似的数据帧

In [8]: df
Out[8]:
type  title           director                                               cast
0  TV Show     3%                NaN  João Miguel, Bianca Comparato, Michel Gomes, R...
1    Movie   7:19  Jorge Michel Grau  Demián Bichir, Héctor Bonilla, Oscar Serrano, ...
2    Movie  23:59       Gilbert Chan  Tedd Chan, Stella Chung, Henley Hii, Lawrence ...
3    Movie      9        Shane Acker  Elijah Wood, John C. Reilly, Jennifer Connelly...
4    Movie     21     Robert Luketic  Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...

然后你可以做一些类似的事情:

In [9]: from collections import Counter
In [10]: Counter(
...:     actor.strip()
...:     for cast in df['cast'].fillna('missing')
...:     for actor in cast.split(',')
...: )
Out[10]:
Counter({'João Miguel': 1,
'Bianca Comparato': 1,
'Michel Gomes': 1,
'Rodolfo Valente': 1,
'Vaneza Oliveira': 1,
'Rafael Lozano': 1,
'Viviane Porto': 1,
'Mel Fronckowiak': 1,
'Sergio Mamberti': 1,
'Zezé Motta': 1,
'Celso Frateschi': 1,
'Demián Bichir': 1,
'Héctor Bonilla': 1,
'Oscar Serrano': 1,
'Azalia Ortiz': 1,
'Octavio Michel': 1,
'Carmen Beato': 1,
'Tedd Chan': 1,
'Stella Chung': 1,
'Henley Hii': 1,
'Lawrence Koh': 1,
'Tommy Kuan': 1,
'Josh Lai': 1,
'Mark Lee': 1,
'Susan Leong': 1,
'Benjamin Lim': 1,
'Elijah Wood': 1,
'John C. Reilly': 1,
'Jennifer Connelly': 1,
'Christopher Plummer': 1,
'Crispin Glover': 1,
'Martin Landau': 1,
'Fred Tatasciore': 1,
'Alan Oppenheimer': 1,
'Tom Kane': 1,
'Jim Sturgess': 1,
'Kevin Spacey': 1,
'Kate Bosworth': 1,
'Aaron Yoo': 1,
'Liza Lapira': 1,
'Jacob Pitts': 1,
'Laurence Fishburne': 1,
'Jack McGee': 1,
'Josh Gad': 1,
'Sam Golzari': 1,
'Helen Carey': 1,
'Jack Gilpin': 1})

只是从时间上比较两种不同的方法:

In [17]: %timeit df['cast'].str.split(', ').explode('cast').value_counts()
667 µs ± 16.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [18]: %%timeit
...: Counter(
...:     actor.strip()
...:     for cast in df['cast'].fillna('missing')
...:     for actor in cast.split(',')
...: )
...:
...:
146 µs ± 2.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

或者也许是一个更公平的比较,没有fillna,并假设您可以在逗号和空格', ':上干净地拆分

In [28]: %%timeit
...: Counter(
...:     actor
...:     for cast in df['cast']
...:     for actor in cast.split(', ')
...: )
...:
...:
...:
12.9 µs ± 543 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

您可以通过分解和分组来获取计数。

df['cast'] = df['cast'].apply(lambda x: x.split(', '))
df = df.explode('cast')
print (df.groupby('cast')['title'].count())

正如@juanpa.arrivilaga所说,您可以将其合并为一行:

print (df['cast'].str.split(', ').explode('cast').value_counts())

完整代码:

import pandas as pd
import numpy as np
d = {'type': {0: 'TV Show', 1: 'Movie', 2: 'Movie', 3: 'Movie', 4: 'Movie'},
'title': {0: '3%', 1: '7:19', 2: '23:59', 3: '9', 4: '21'},
'director': {0: np.nan,
1: 'Jorge Michel Grau',
2: 'Gilbert Chan',
3: 'Shane Acker',
4: 'Robert Luketic'},
'cast': {0: 'João Miguel, Bianca Comparato, Michel Gomes, Rodolfo Valente, Vaneza Oliveira, Rafael Lozano, Viviane Porto, Mel Fronckowiak, Sergio Mamberti, Zezé Motta, Celso Frateschi',
1: 'Demián Bichir, Héctor Bonilla, Oscar Serrano, Azalia Ortiz, Octavio Michel, Carmen Beato',
2: 'Tedd Chan, Stella Chung, Henley Hii, Lawrence Koh, Tommy Kuan, Josh Lai, Mark Lee, Susan Leong, Benjamin Lim',
3: 'Elijah Wood, John C. Reilly, Jennifer Connelly, Christopher Plummer, Crispin Glover, Martin Landau, Fred Tatasciore, Alan Oppenheimer, Tom Kane',
4: 'Jim Sturgess, Kevin Spacey, Kate Bosworth, Aaron Yoo, Liza Lapira, Jacob Pitts, Laurence Fishburne, Jack McGee, Josh Gad, Sam Golzari, Helen Carey, Jack Gilpin'}}
df = pd.DataFrame(d)
df['cast'] = df['cast'].apply(lambda x: x.split(', '))
df = df.explode('cast')
print (df.groupby('cast')['title'].count())

其输出为:

cast
Aaron Yoo              1
Alan Oppenheimer       1
Azalia Ortiz           1
Benjamin Lim           1
Bianca Comparato       1
Carmen Beato           1
Celso Frateschi        1
Christopher Plummer    1
Crispin Glover         1
Demián Bichir          1
Elijah Wood            1
Fred Tatasciore        1
Helen Carey            1
Henley Hii             1
Héctor Bonilla         1
Jack Gilpin            1
Jack McGee             1
Jacob Pitts            1
Jennifer Connelly      1
Jim Sturgess           1
John C. Reilly         1
Josh Gad               1
Josh Lai               1
João Miguel            1
Kate Bosworth          1
Kevin Spacey           1
Laurence Fishburne     1
Lawrence Koh           1
Liza Lapira            1
Mark Lee               1
Martin Landau          1
Mel Fronckowiak        1
Michel Gomes           1
Octavio Michel         1
Oscar Serrano          1
Rafael Lozano          1
Rodolfo Valente        1
Sam Golzari            1
Sergio Mamberti        1
Stella Chung           1
Susan Leong            1
Tedd Chan              1
Tom Kane               1
Tommy Kuan             1
Vaneza Oliveira        1
Viviane Porto          1
Zezé Motta             1
Name: title, dtype: int64

最新更新