假设我用于单个文档
text="bla agao haa"
singleTFIDF = TfidfVectorizer(analyzer='char_wb', ngram_range=
(4,6),preprocessor=my_tokenizer, max_features=100).fit([text])
single=singleTFIDF.transform([text])
query = singleTFIDF.transform(["new coming document"])
如果我理解正确,转换只是使用从适合中学习的权重。因此,对于新文档,查询包含文档中每个要素的权重。它看起来像 [[0,,0,0.13,0.4,0]]
由于我使用 n-gram,我也想获得这个新文档的功能。因此,我知道新文档中每个功能的权重。
编辑:
就我而言,我得到单个并查询以下数组:
single
[[0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
0.10721125 0.10721125 0.10721125 0.10721125 0.10721125 0.10721125
0.10721125 0.10721125 0.10721125]]
query
[[0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0.57735027 0.57735027 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0.57735027 0. 0.
0. 0. 0. ]]
但这很奇怪,因为从学习语料库(单个)来看,所有特征的权重均为 0.10721125。那么,新文档的功能如何具有 0.57735027 的权重呢?
Scikit-Learn如何计算tfidf的详细信息可以在这里找到,这里是使用单词n-grams实现它的示例。
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Train the vectorizer
text="this is a simple example"
singleTFIDF = TfidfVectorizer(ngram_range=(1,2)).fit([text])
singleTFIDF.vocabulary_ # show the word-matrix position pairs
# Analyse the training string - text
single=singleTFIDF.transform([text])
single.toarray() # displays the resulting matrix - all values are equal because all terms are present
# Analyse two new strings with the trained vectorizer
doc_1 = ['is this example working', 'hopefully it is a good example', 'no matching words here']
query = singleTFIDF.transform(doc_1)
query.toarray() # displays the resulting matrix - only matched terms have non-zero values
# Compute the cosine similarity between text and doc_1 - the second string has only two matching terms, therefore it has a lower similarity value
cos_similarity = cosine_similarity(single.A, query.A)
输出:
singleTFIDF.vocabulary_
Out[297]:
{'this': 5,
'is': 1,
'simple': 3,
'example': 0,
'this is': 6,
'is simple': 2,
'simple example': 4}
single.toarray()
Out[299]:
array([[0.37796447, 0.37796447, 0.37796447, 0.37796447, 0.37796447,
0.37796447, 0.37796447]])
query.toarray()
Out[311]:
array([[0.57735027, 0.57735027, 0. , 0. , 0. ,
0.57735027, 0. ],
[0.70710678, 0.70710678, 0. , 0. , 0. ,
0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0. , 0. ]])
np.sum(np.square(query.toarray()), axis=1) # note how all rows with non-zero scores have been normalised to 1.
Out[3]: array([1., 1., 0.])
cos_similarity
Out[313]: array([[0.65465367, 0.53452248, 0. ]])
新文档具有新的权重,因为 tfidfvectorizer 规范化权重。因此,将参数norm
设置为 None
。norm
的默认值为 l2
。
为了更多地了解规范的影响,我建议您查看我对这个问题的回答。