如何获得tfidf.get_feature_names_out()生成的术语的频率

在使用tfidf进行拟合之后，我正在查看生成的特征:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())

但是我也想得到每一项的频率

计算特定单词在中出现的句子数量的一种方法是使用sklearn.feature_extraction.text.CountVectorizer。

corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
from sklearn.feature_extraction.text import CountVectorizer
# since we're counting sentences and not words, use binary=True
cv = CountVectorizer(binary=True)
X = cv.fit_transform(corpus)
print(cv.vocabulary_)  # all the words in the corpus with their column index
# {'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
# show occurrences (not count) of vocabulary words in sentences (each line/row) in corpus
print(X.toarray())
# [[0 1 1 1 0 0 1 0 1]
#  [0 1 0 1 0 1 1 0 1]
#  [1 0 0 1 1 0 1 1 1]
#  [0 1 1 1 0 0 1 0 1]]
# So, for example the word "this" is at column index 8 in the matrix above
# How many sentences in the corpus have the word "this"?
print(sum(X[:,cv.vocabulary_["this"]])[0,0])
# 4
# How many sentences in the corpus have the word "document"?
print(sum(X[:,cv.vocabulary_["document"]])[0,0])
# 3

相关内容

最新更新

热门标签：