TfidfVectorizer它不会消除出现多次的单词



我有一个数据集,我试图聚类到。虽然我在Tfidf中设置了min_df和max_df,但是MiniBatchKmeans返回给我的输出包含了根据文档Vectorizer应该消除的单词,因为它们至少存在于另一个文档中(max_df=1.)。

设置:

min_df = 5            
max_df = 1.         
vectorizer = TfidfVectorizer(stop_words='english',min_df=min_df, 
max_df=max_df,  max_features=100000) ## Corpus is in English
c_vectorizer = CountVectorizer(stop_words='english',min_df=min_df,   
max_df=max_df, max_features=100000) ## Corpus is in English
X = vectorizer.fit_transform(dataset)
C_X = c_vectorizer.fit_transform(dataset)

MiniBatchKMeans的输出:

Topic0: information book history read good great lot author write    
useful use recommend need time make know provide like easy   
excellent just learn look work want help reference buy guide 
interested
Topic1: book read good great use make write buy time work like   
just recommend know look year need author want think help new life 
way love people really excellent easy say
Topic2: story novel character book life read love time write make   
like reader great end woman world good man work plot way people  
just family know come young author think year

可以看到"book"是在所有3个主题,但与max_df=1。它不应该被删除吗?

来自TfidfVectorizer文档:

max_df:floatint,default=1.0

在构建词汇表时,忽略文档频率严格高于给定阈值的术语(特定于语料库的停止词)。如果float在[0.0,1.0]范围内,则该参数表示文档的比例,整数绝对计数。如果词汇表不是None,则忽略此参数。

因此问题中的max_df被设置为默认值。

您可能想要这样的设置:">删除在99%以上的文档中出现的单词":

from sklearn.feature_extraction.text import TfidfVectorizer
raw_data = [
"books cats coffee",
"books cats",
"books and coffee and coffee",
"books and words and coffee",
]
tfidf = TfidfVectorizer(stop_words="english", max_df=0.99)
X = tfidf.fit_transform(raw_data)
print(tfidf.get_feature_names_out())
print(X.todense())
['cats' 'coffee' 'words']
[[0.77722116 0.62922751 0.        ]
[1.         0.         0.        ]
[0.         1.         0.        ]
[0.         0.53802897 0.84292635]]

如果您确实想要删除中至少一个其他文档中的任何单词,CountVectorizer是更好的方法:

from sklearn.feature_extraction.text import CountVectorizer
raw_data = [
"unique books cats coffee",
"case books cats",
"for books and words coffee and coffee",
"each books and words and coffee",
]
tfidf = CountVectorizer(max_df=1)
X = tfidf.fit_transform(raw_data)
print(tfidf.get_feature_names_out())
print(X.todense())
['case' 'each' 'for' 'unique']
[[0 0 0 1]
[1 0 0 0]
[0 0 1 0]
[0 1 0 0]]

最新更新