我有一个来自Pandas DataFrame 的系列
19607 uhmm i guess i start wit my name.. trung<br />...
6205 you could say my interests revolve around tech...
57858 i always find it difficult to sum myself up wi...
29471 loyal, witty, silly, understanding, dedicated,...
47277 so basically, i hate these "fill in your own w...
25535 i am ending a relationship with a woman right ...
51731 i work and live in san francisco. i enjoy what...
19106 i love being outside when the sun is out. i <a...
18594 i've met someone and am in a long-term relatio...
7326 humanitarian, teamplayer, great work ethic, re...
我想计算每行的平均单词长度。我该如何实现它?
让我们使用str.split
将句子拆分为单词。然后explode
和str.len
:
s.str.split().explode().str.len().mean(level=0)
你会得到这样的东西:
0
19607 4.000000
6205 5.250000
57858 4.000000
29471 9.000000
47277 4.000000
25535 4.000000
51731 4.000000
19106 3.545455
18594 4.555556
7326 7.333333
Name: 1, dtype: float64
在我的回答中,我有:
- 删除了标点符号(但保留了空格(,因为这不应该是计数的一部分
- 在空间上拆分
- 用列表理解计算平均值
- 加入了原始系列,因此您可以并排查看结果
import re
import numpy as np
# s = pd.Series(d[1]) # I have called you pandas series "s" from your StackOverFlow question. If it is called something else change from s.apply to your_series.apply
s1 = (s.apply(lambda x: re.sub(r'[^a-z|s]', '', x))
.str.split('s+')
.apply(lambda x: np.mean([len(y) for y in x])))
df = pd.concat([s,s1], axis=1)
df
Out[1]:
1 1
0
19607 uhmm i guess i start wit my name.. trung<br />... 3.200000
6205 you could say my interests revolve around tech... 4.875000
57858 i always find it difficult to sum myself up wi... 3.700000
29471 loyal, witty, silly, understanding, dedicated,... 7.400000
47277 so basically, i hate these "fill in your own w... 3.500000
25535 i am ending a relationship with a woman right ... 3.700000
51731 i work and live in san francisco. i enjoy what... 3.600000
19106 i love being outside when the sun is out. i <a... 3.090909
18594 i've met someone and am in a long-term relatio... 4.000000
7326 humanitarian, teamplayer, great work ethic, re... 6.333333