Python-gensim(TfidfModel):Tf-Idf是如何计算的

1。对于以下测试文本，

test=['test test', 'test toy']

tf-idf分数[不进行归一化(smartirs:'ntn'(]为

[['test', 1.17]]  
[['test', 0.58], ['toy', 1.58]]

这似乎与我通过的直接计算得到的不一致

tfidf (w, d) = tf x idf  
where idf(term)=log (total number of documents / number of documents containing term)   
tf = number of instances of word in d document / total number of words of d document

例如

doc 1: 'test test'  
for "test" word  
tf= 1  
idf= log(2/2) = 0  
tf-idf = 0

有人能用我上面的测试文本给我看一下计算结果吗？

2(当我更改为余弦归一化(smartirs:'ntc'(时，我得到

[['test', 1.0]]  
[['test', 0.35], ['toy', 0.94]]

有人能给我看一下计算吗？

谢谢

import gensim
from gensim import corpora
from gensim import models
import numpy as np
from gensim.utils import simple_preprocess
test=['test test', 'test toy']

texts = [simple_preprocess(doc) for doc in test]

mydict= corpora.Dictionary(texts)
mycorpus = [mydict.doc2bow(doc, allow_update=True) for doc in texts]
tfidf = models.TfidfModel(mycorpus, smartirs='ntn')

for doc in tfidf[mycorpus]:
print([[mydict[id], np.around(freq, decimals=2)] for id, freq in doc])

如果您想了解model.TfidfModel实现的详细信息，可以直接在gensim的GitHub存储库中查看。SMART信息检索系统的维基百科页面上描述了与smartirs='ntn'相对应的特定计算方案，精确的计算与您使用的计算不同，因此结果存在差异。

例如，您所指的特定差异：

idf= log(2/2) = 0

实际上应该是log2(N+1/N_k(：

idf= log(2/1) = 1

我建议您同时查看实现和提到的页面，以确保您的手动检查遵循所选smartirs标志的实现。

相关内容

最新更新

热门标签：