Collections.counter()正在计算字母而不是单词



我必须计算df['messages']行中数据帧中出现次数最多的单词数。它有很多列,所以我将所有行格式化并存储为一个变量all_words中的单个字符串(用空格连接的单词(。CCD_ 3具有用空格分隔的所有单词。但当我试着统计最常用的单词时,它显示了我最常用的字母表。我的数据格式为:

0    abc de fghi klm
1    qwe sd fd s dsdd sswd??
3    ded fsf sfsdc wfecew wcw.

这是我的代码片段

from collections import Counter
all_words = ' '
for msg in df['messages'].values:
words = str(msg).lower()
all_words = all_words + str(words) + ' '

count = Counter(all_words)
count.most_common(3)

这是它的输出:

[(' ', 5260), ('a', 2919), ('h', 1557)]

我还尝试过使用df['messages'].value_counts()。但它返回的是最常用的行(整个句子(,而不是单词。类似:

asad adas asda     10
asaa as awe        3
wedxew dqwed       1

请告诉我哪里错了,或者建议其他可行的方法。

Counter迭代您传递给它的内容。如果您传递一个字符串,它就会迭代它有字符(这就是它的计数(。如果你给它一个列表(每个列表都是一个单词(,它将按单词计数。

from collections import Counter
text = "spam and more spam"
c = Counter()
c.update(text)  # text is a str, count chars
c
# Counter({'s': 2, 'p': 2, 'a': 3, 'm': 3, [...], 'e': 1})
c = Counter()
c.update(text.split())  # now is a list like: ['spam', 'and', 'more', 'spam']
c
# Counter({'spam': 2, 'and': 1, 'more': 1})

所以,你应该这样做:

from collections import Counter
all_words = []
for msg in df['messages'].values:
words = str(msg).lower() 
all_words.append(words)
count = Counter(all_words)
count.most_common(3)
# the same, but with  generator comprehension
count = Counter(str(msg).lower() for msg in df['messages'].values)
from collections import Counter
all_words = []
for msg in df['messages'].values:
words = str(msg).lower().strip().split(' ')
all_words.extend(words)

count = Counter(all_words)
count.most_common(3)

最新更新