Python中的单词计数

我在python中有一个字符串列表。

list = [ "Sentence1. Sentence2...", "Sentence1. Sentence2...",...]

我想删除停止词，并计算所有不同字符串中每个单词的出现次数。有简单的方法吗？

我目前正在考虑使用scikit中的CountVectorizer（），然后对每个单词进行迭代并组合结果

如果您不介意安装一个新的python库，我建议您使用gensim。第一个教程完全按照你的要求：

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

然后，您需要为您的文档语料库创建字典，并创建单词包。

dictionary = corpora.Dictionary(texts)
dictionary.save('/tmp/deerwester.dict') # store the dictionary, for future 
print(dictionary)

你可以使用tf-idf和stuff对结果进行加权，然后很容易地进行LDA。

看看这里的教程1

您未能彻底解释您的想法，但这可能是您想要的：

counts = collections.Counter(' '.join(your_list).split())

相关内容

最新更新

热门标签：