如何计算Pandas系列长字符串的平均工作长度

我有一个来自Pandas DataFrame 的系列

19607    uhmm i guess i start wit my name.. trung<br />...
6205     you could say my interests revolve around tech...
57858    i always find it difficult to sum myself up wi...
29471    loyal, witty, silly, understanding, dedicated,...
47277    so basically, i hate these "fill in your own w...
25535    i am ending a relationship with a woman right ...
51731    i work and live in san francisco. i enjoy what...
19106    i love being outside when the sun is out. i <a...
18594    i've met someone and am in a long-term relatio...
7326     humanitarian, teamplayer, great work ethic, re...

我想计算每行的平均单词长度。我该如何实现它？

让我们使用str.split将句子拆分为单词。然后explode和str.len:

s.str.split().explode().str.len().mean(level=0)

你会得到这样的东西：

0
19607    4.000000
6205     5.250000
57858    4.000000
29471    9.000000
47277    4.000000
25535    4.000000
51731    4.000000
19106    3.545455
18594    4.555556
7326     7.333333
Name: 1, dtype: float64

在我的回答中，我有：

删除了标点符号(但保留了空格(，因为这不应该是计数的一部分
在空间上拆分
用列表理解计算平均值
加入了原始系列，因此您可以并排查看结果

import re
import numpy as np
# s = pd.Series(d[1]) # I have called you pandas series "s" from your StackOverFlow question. If it is called something else change from s.apply to your_series.apply
s1 = (s.apply(lambda x: re.sub(r'[^a-z|s]', '', x))
.str.split('s+')
.apply(lambda x: np.mean([len(y) for y in x])))
df = pd.concat([s,s1], axis=1)
df
Out[1]: 
1         1
0                                                                 
19607  uhmm i guess i start wit my name.. trung<br />...  3.200000
6205   you could say my interests revolve around tech...  4.875000
57858  i always find it difficult to sum myself up wi...  3.700000
29471  loyal, witty, silly, understanding, dedicated,...  7.400000
47277  so basically, i hate these "fill in your own w...  3.500000
25535  i am ending a relationship with a woman right ...  3.700000
51731  i work and live in san francisco. i enjoy what...  3.600000
19106  i love being outside when the sun is out. i <a...  3.090909
18594  i've met someone and am in a long-term relatio...  4.000000
7326   humanitarian, teamplayer, great work ethic, re...  6.333333

相关内容

最新更新

热门标签：