如何在特定数据集上计算TF-IDF



我有一个文章数据集。一些在线示例通常对语料库进行硬编码。如果我想计算我自己数据集的TF-IDF,我应该怎么做?

注意:我创建了一个数据帧来存储这些数据。这是我的代码

pip install scikit-learn
from sklearn.feature_extraction.text import CountVectorizer 
corpus = merged_df['title']
vectorizer = CountVectorizer()
wordFrequency = vectorizer.fit_transform(corpus)
word = vectorizer.get_feature_names()
print(word)

#-----------------------
from sklearn.feature_extraction.text import TfidfTransformer 
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(wordFrequency)

您可能会尝试TfIfdVectorizer而不是CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=5) # min_df Applies to minimum document frequency, not necessary
wordFrequency = vectorizer.fit_transform(corpus)
word = vectorizer.get_feature_names()
print(word)

基本上,只需更改您使用的矢量器即可。干杯

最新更新