我有一个文章数据集。一些在线示例通常对语料库进行硬编码。如果我想计算我自己数据集的TF-IDF,我应该怎么做?
注意:我创建了一个数据帧来存储这些数据。这是我的代码
pip install scikit-learn
from sklearn.feature_extraction.text import CountVectorizer
corpus = merged_df['title']
vectorizer = CountVectorizer()
wordFrequency = vectorizer.fit_transform(corpus)
word = vectorizer.get_feature_names()
print(word)
#-----------------------
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(wordFrequency)
您可能会尝试TfIfdVectorizer而不是CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=5) # min_df Applies to minimum document frequency, not necessary
wordFrequency = vectorizer.fit_transform(corpus)
word = vectorizer.get_feature_names()
print(word)
基本上,只需更改您使用的矢量器即可。干杯