小贝子编程

将单词添加到scikit-learn的CountVectorizer的停用列表中

Scikit-learn的CountVectorizer类允许您将字符串'english'传递给参数stop_words。我想在这个预定义列表中添加一些东西。有人能告诉我怎么做吗?

根据sklearn.feature_extraction.text的源代码，ENGLISH_STOP_WORDS的完整列表(实际上是frozenset，来自stop_words)通过__all__暴露。因此，如果你想使用这个列表加上更多的项目，你可以这样做:

from sklearn.feature_extraction import text 
stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_words)

(其中my_additional_stop_words是任意字符串序列)并使用结果作为stop_words参数。_check_stop_list对CountVectorizer.__init__的输入进行解析，并直接传递新的frozenset。

相关内容