以下是我正在处理的数据集:
{'type': {0: 'TV Show', 1: 'Movie', 2: 'Movie', 3: 'Movie', 4: 'Movie'},
'title': {0: '3%', 1: '7:19', 2: '23:59', 3: '9', 4: '21'},
'director': {0: nan,
1: 'Jorge Michel Grau',
2: 'Gilbert Chan',
3: 'Shane Acker',
4: 'Robert Luketic'},
'cast': {0: 'João Miguel, Bianca Comparato, Michel Gomes, Rodolfo Valente, Vaneza Oliveira, Rafael Lozano, Viviane Porto, Mel Fronckowiak, Sergio Mamberti, Zezé Motta, Celso Frateschi',
1: 'Demián Bichir, Héctor Bonilla, Oscar Serrano, Azalia Ortiz, Octavio Michel, Carmen Beato',
2: 'Tedd Chan, Stella Chung, Henley Hii, Lawrence Koh, Tommy Kuan, Josh Lai, Mark Lee, Susan Leong, Benjamin Lim',
3: 'Elijah Wood, John C. Reilly, Jennifer Connelly, Christopher Plummer, Crispin Glover, Martin Landau, Fred Tatasciore, Alan Oppenheimer, Tom Kane',
4: 'Jim Sturgess, Kevin Spacey, Kate Bosworth, Aaron Yoo, Liza Lapira, Jacob Pitts, Laurence Fishburne, Jack McGee, Josh Gad, Sam Golzari, Helen Carey, Jack Gilpin'}}
我想做的是提取每个演员在电影/电视节目中的参与次数。我创建了几个函数来从数据中分离参与者的名称:
def quick_extract(x) :
z = re.split(',', x)
for n, i in enumerate(z):
z[n] = z[n].strip(' ')
return z
def unique(data, column) :
x = data[column].fillna('missing')
s = []
for l in x.apply(lambda z: quick_extract(z)) :
s+=list(l)
s = [i for i in s if len(i)>0]
return set(s)
unique(data, 'cast') # there are 32882 distinct actors in the dataset
为了计算参与者的参与次数,我使用了如下的for循环:
for l in unique(data, 'cast') :
print(l, data['cast'].fillna('missing').apply(lambda z: l in z).sum())
但是由于它的迭代次数为32882次,所以它花费了太多时间。迭代方法通常需要花费大量时间,在这种情况下,有其他有效的方法可以花费更少的时间吗?
您的整个方法效率非常低。只需使用collections.Counter
对象。假设你有一个类似的数据帧
In [8]: df
Out[8]:
type title director cast
0 TV Show 3% NaN João Miguel, Bianca Comparato, Michel Gomes, R...
1 Movie 7:19 Jorge Michel Grau Demián Bichir, Héctor Bonilla, Oscar Serrano, ...
2 Movie 23:59 Gilbert Chan Tedd Chan, Stella Chung, Henley Hii, Lawrence ...
3 Movie 9 Shane Acker Elijah Wood, John C. Reilly, Jennifer Connelly...
4 Movie 21 Robert Luketic Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...
然后你可以做一些类似的事情:
In [9]: from collections import Counter
In [10]: Counter(
...: actor.strip()
...: for cast in df['cast'].fillna('missing')
...: for actor in cast.split(',')
...: )
Out[10]:
Counter({'João Miguel': 1,
'Bianca Comparato': 1,
'Michel Gomes': 1,
'Rodolfo Valente': 1,
'Vaneza Oliveira': 1,
'Rafael Lozano': 1,
'Viviane Porto': 1,
'Mel Fronckowiak': 1,
'Sergio Mamberti': 1,
'Zezé Motta': 1,
'Celso Frateschi': 1,
'Demián Bichir': 1,
'Héctor Bonilla': 1,
'Oscar Serrano': 1,
'Azalia Ortiz': 1,
'Octavio Michel': 1,
'Carmen Beato': 1,
'Tedd Chan': 1,
'Stella Chung': 1,
'Henley Hii': 1,
'Lawrence Koh': 1,
'Tommy Kuan': 1,
'Josh Lai': 1,
'Mark Lee': 1,
'Susan Leong': 1,
'Benjamin Lim': 1,
'Elijah Wood': 1,
'John C. Reilly': 1,
'Jennifer Connelly': 1,
'Christopher Plummer': 1,
'Crispin Glover': 1,
'Martin Landau': 1,
'Fred Tatasciore': 1,
'Alan Oppenheimer': 1,
'Tom Kane': 1,
'Jim Sturgess': 1,
'Kevin Spacey': 1,
'Kate Bosworth': 1,
'Aaron Yoo': 1,
'Liza Lapira': 1,
'Jacob Pitts': 1,
'Laurence Fishburne': 1,
'Jack McGee': 1,
'Josh Gad': 1,
'Sam Golzari': 1,
'Helen Carey': 1,
'Jack Gilpin': 1})
只是从时间上比较两种不同的方法:
In [17]: %timeit df['cast'].str.split(', ').explode('cast').value_counts()
667 µs ± 16.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [18]: %%timeit
...: Counter(
...: actor.strip()
...: for cast in df['cast'].fillna('missing')
...: for actor in cast.split(',')
...: )
...:
...:
146 µs ± 2.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
或者也许是一个更公平的比较,没有fillna,并假设您可以在逗号和空格', '
:上干净地拆分
In [28]: %%timeit
...: Counter(
...: actor
...: for cast in df['cast']
...: for actor in cast.split(', ')
...: )
...:
...:
...:
12.9 µs ± 543 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
您可以通过分解和分组来获取计数。
df['cast'] = df['cast'].apply(lambda x: x.split(', '))
df = df.explode('cast')
print (df.groupby('cast')['title'].count())
正如@juanpa.arrivilaga所说,您可以将其合并为一行:
print (df['cast'].str.split(', ').explode('cast').value_counts())
完整代码:
import pandas as pd
import numpy as np
d = {'type': {0: 'TV Show', 1: 'Movie', 2: 'Movie', 3: 'Movie', 4: 'Movie'},
'title': {0: '3%', 1: '7:19', 2: '23:59', 3: '9', 4: '21'},
'director': {0: np.nan,
1: 'Jorge Michel Grau',
2: 'Gilbert Chan',
3: 'Shane Acker',
4: 'Robert Luketic'},
'cast': {0: 'João Miguel, Bianca Comparato, Michel Gomes, Rodolfo Valente, Vaneza Oliveira, Rafael Lozano, Viviane Porto, Mel Fronckowiak, Sergio Mamberti, Zezé Motta, Celso Frateschi',
1: 'Demián Bichir, Héctor Bonilla, Oscar Serrano, Azalia Ortiz, Octavio Michel, Carmen Beato',
2: 'Tedd Chan, Stella Chung, Henley Hii, Lawrence Koh, Tommy Kuan, Josh Lai, Mark Lee, Susan Leong, Benjamin Lim',
3: 'Elijah Wood, John C. Reilly, Jennifer Connelly, Christopher Plummer, Crispin Glover, Martin Landau, Fred Tatasciore, Alan Oppenheimer, Tom Kane',
4: 'Jim Sturgess, Kevin Spacey, Kate Bosworth, Aaron Yoo, Liza Lapira, Jacob Pitts, Laurence Fishburne, Jack McGee, Josh Gad, Sam Golzari, Helen Carey, Jack Gilpin'}}
df = pd.DataFrame(d)
df['cast'] = df['cast'].apply(lambda x: x.split(', '))
df = df.explode('cast')
print (df.groupby('cast')['title'].count())
其输出为:
cast
Aaron Yoo 1
Alan Oppenheimer 1
Azalia Ortiz 1
Benjamin Lim 1
Bianca Comparato 1
Carmen Beato 1
Celso Frateschi 1
Christopher Plummer 1
Crispin Glover 1
Demián Bichir 1
Elijah Wood 1
Fred Tatasciore 1
Helen Carey 1
Henley Hii 1
Héctor Bonilla 1
Jack Gilpin 1
Jack McGee 1
Jacob Pitts 1
Jennifer Connelly 1
Jim Sturgess 1
John C. Reilly 1
Josh Gad 1
Josh Lai 1
João Miguel 1
Kate Bosworth 1
Kevin Spacey 1
Laurence Fishburne 1
Lawrence Koh 1
Liza Lapira 1
Mark Lee 1
Martin Landau 1
Mel Fronckowiak 1
Michel Gomes 1
Octavio Michel 1
Oscar Serrano 1
Rafael Lozano 1
Rodolfo Valente 1
Sam Golzari 1
Sergio Mamberti 1
Stella Chung 1
Susan Leong 1
Tedd Chan 1
Tom Kane 1
Tommy Kuan 1
Vaneza Oliveira 1
Viviane Porto 1
Zezé Motta 1
Name: title, dtype: int64