有没有办法在 python 中找到带有 TF-IDF 的句子


x=["hello there","hello world","my name is john"]


这是 TF-IDF 的输出

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
"hello there","hello world","my name is john", ]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

array([[0.60534851, 0.        , 0.        , 0.        , 0.        ,
0.79596054, 0.        ],
[0.60534851, 0.        , 0.        , 0.        , 0.        ,
0.        , 0.79596054],
[0.        , 0.5       , 0.5       , 0.5       , 0.5       ,
0.        , 0.        ]])



我相信使用 TF-idf,您只能计算句子(或文档(中单个单词的权重,这意味着您不能使用它来计算其他句子或文档中句子的权重。


import math
corpus = ["hello there", "hello world"]
file = open("your_document.txt", "r")
text = file.read()
def computeTF(sentences, document):
dict = {i: 0 for i in sentences}
filelen = len(text.split(' ')) - 1
for s in sentences:
#   Since we're counting a whole sentence (containing >= 1 words) we need to count
#   that whole sentence as a single word.
sLength = len(s.split(' '))
dict[s] = document.count(s)
#   When you know the amount of occurences of the specific sentence s in the
#   document, you can recalculate the amount of words in that document (considering
#   s as a single word.
filelen = filelen - dict[s] * (sLength - 1)
for s in sentences:
#   Since only after the previous we know the amount of words in the document, we
#   need a separate loop to calculate the actual weights of each word.
dict[s] = dict[s] / filelen
return dict
def computeIDF(dict, sentences):
idfDict = {s: dict[s] for s in sentences}
N = len(dict)
for s in sentences:
if(idfDict[s] > 0):
idfDict[s] = math.log10(N)
idfDict[s] = 0
return idfDict
dict = computeTF(corpus, text)
idfDict = computeIDF(dict, corpus)
for s in corpus:
print("Sentence: {}, TF: {}, TF-idf: {}".format(s, dict[s], idfDict[s]))

