在使用 WordCloud for Python 时，为什么在云中考虑字母的频率"S"？

我正在了解Python的WordCloud包，我正在用NLTK的Moby Dick Text测试它。代码片段如下:

示例字符串

从图片的高亮部分可以看出，所有的所有格撇号都被转义为"/'S"WordCount似乎将其包括在频率计数中，如"S":

词频分布

当然这会引起问题，因为"S"被视为高频，所有其他单词的频率在云中都是倾斜的:

我的歪斜云的例子

在我所遵循的同一个白鲸记字符串的教程中，WordCloud似乎没有计算" "。我是否在某个地方丢失了一个属性，或者我是否必须手动删除"/' "从我的绳子上?

下面是我的代码摘要:

example_corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
word_list = ["".join(word) for word in example_corpus]
novel_as_string = " ".join(word_list)
wordcloud = WordCloud().generate(novel_as_string)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

在这种应用程序中，通常首先使用stopwords来过滤单词列表，因为您不希望简单的单词，例如a, an, the, it, ...，支配您的结果。

稍微修改了一下代码，希望能有所帮助。可以查看stopwords的含量

import nltk
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
example_corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
# word_list = ["".join(word) for word in example_corpus] # this statement seems like change nothing
# using stopwords to filter words
word_list = [word for word in example_corpus if word not in stopwords.words('english')]
novel_as_string = " ".join(word_list)
wordcloud = WordCloud().generate(novel_as_string)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

输出:参见wordcloud Imgur

看起来你的输入是问题的一部分，如果你看起来像这样，

corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
words = [word for word in  corpus]
print word[215:230]

你

['RICHARDSON', "'", 'S', 'DICTIONARY', 'KETOS', ',', 'GREEK', '.', 'CETUS', ',', 'LATIN', '.', 'WHOEL', ',', 'ANGLO']

你可以做一些事情来尝试克服这个问题，你可以过滤长度大于1的字符串，

words = [word for word in corpus if len(word) > 1]

您可以尝试nltk提供的其他文件，或者您可以尝试读取原始输入并正确解码它。

相关内容

最新更新

热门标签：