制作所有唯一单词的数据帧及其计数和



我有一个这样的数据帧df1

id `  text                             c1      
1     Hello world how are you people    1 
2     Hello people I am fine  people    1
3     Good Morning people               -1
4     Good Evening                      -1

我想使df2这样,它只包含一次df1的所有单词及其计数(总出现次数)

我想对c1列求和,并在df2中创建一个新列(仅当该行中有单词时才求和)。

预期产出:

Word      Totalcount     Points  
hello        2             2             
world        1             1              
how          1             1              
are          1             1              
you          1             1              
people       3             1              
I            1             1             
am           1             1              
fine         1             1             
Good         2             -2            
Morning      1             -1            
Evening      1             -1       

首先按DataFrame.popSeries.str.splitDataFrame.stack提取列,用于SeriesDataFrame.join到原始列,然后按DataFrame.drop_duplicates删除重复项并按GroupBy.agg聚合,其中包含计数和sum

s = (df.pop('text')
.str.split(expand=True)
.stack()
.reset_index(1, drop=True)
.rename('text'))
df1 = (df.join(s)
.reset_index(drop=True)
.drop_duplicates(['id','text'])
.groupby('text', sort=False)['c1']
.agg([('Totalcount','size'),('Points','sum')])
.reset_index()
.rename(columns={'text':'Word'}))
print (df1)
Word  Totalcount  Points
0     Hello           2       2
1     world           1       1
2       how           1       1
3       are           1       1
4       you           1       1
5    people           3       1
6         I           1       1
7        am           1       1
8      fine           1       1
9      Good           2      -2
10  Morning           1      -1
11  Evening           1      -1

编辑:

为了获得更好的性能,请使用chain.from_iterablenumpy.repeat

from itertools import chain
splitted = [x.split() for x in df['text']]
lens = [len(x) for x in splitted]
df = pd.DataFrame({
'Word' : list(chain.from_iterable(splitted)), 
'id' : df['id'].values.repeat(lens),
'c1' : df['c1'].values.repeat(lens)
})
df1 = (df.drop_duplicates(['id','Word'])
.groupby('Word', sort=False)['c1']
.agg([('Totalcount','size'),('Points','sum')])
.reset_index())
print (df1)
Word  Totalcount  Points
0     Hello           2       2
1     world           1       1
2       how           1       1
3       are           1       1
4       you           1       1
5    people           3       1
6         I           1       1
7        am           1       1
8      fine           1       1
9      Good           2      -2
10  Morning           1      -1
11  Evening           1      -1

最新更新