import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.DataFrame({"Final Tweet": ["bad delivery", "bad product"]})
tf_idf = TfidfVectorizer(binary=True)
tfidf_mat = tf_idf.fit_transform(df["Final Tweet"]).toarray()
tfidf = pd.DataFrame(tfidf_mat, columns=tf_idf.get_feature_names_out())
tfidf.head()
bad delivery product
0 0.579739 0.814802 0.000000
1 0.579739 0.000000 0.814802
df.head()
Final Tweet
0 bad delivery
1 bad product
我已经使用各种公式进行了手动计算,但是我从上面的jupyter notebook输出中得到了不同的结果。你能帮我手工计算吗?因此,结果值可以与上面的
相同遵循维基百科TF-IDF文章中的命名法:
每个文档中的术语频率
tf(t, d) = f_{t,d} / (sum_{t' in d} (f_{t',d}))
- 特遣部队("糟糕的;,"坏delivery" = 1/(1 + 1) = 1/2 = 0.5
- tf("delivery" bad delivery") = 1/(1 + 1) = 1/2 = 0.5
- tf(&;product&; &;bad delivery&;) = 0/(1 + 1) = 0/2 = 0.0
- 特遣部队("糟糕,"坏product") = 1/(1 + 1) = 1/2 = 0.5
- tf("交货"不良产品") = 0/(1 + 1) = 0/2 = 0.0
- tf("product"不良产品") = 1/(1 + 1) = 1/2 = 0.5
注意:一个单词在文档中出现的频率越高,它的TF就越大
逆文档频率
idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1
- idf("糟糕的)= ln ((2 + 1)/(2 + 1) + 1 = ln (3/3) + 1 = 1.0
- idf("delivery" = ln ((2 + 1)/(1 + 1)) + 1 = ln (3/2) + 1 = 1.405465
- idf("product" = ln ((2 + 1)/(1 + 1)) + 1 = ln (3/2) + 1 = 1.405465
注意:语料库中越罕见的单词,其IDF就越大
TF*IDF(未归一化)
tfidf(t, d) = tf(t, d) * idf(t)
- tfidf("bad", "bad delivery") = 0.5 * 1.0 = 0.5
- tfidf("delivery";bad delivery") = 0.5 * 1.405465 = 0.702733
- tfidf("product";bad delivery") = 0.0 * 1.405465 = 0.0
- tfidf("bad", "bad product") = 0.5 * 1.0 = 0.5
- tfidf("delivery" "bad product") = 0.0 * 1.405465 = 0.0
- tfidf("product" bad product") = 0.5 * 1.405465 = 0.702733
注意:产品平衡语料库中术语的总体稀缺性和它们在每个特定文档上的频率)
TF*IDF(标准化,使每个文档的L2-norm等于1.0)
划分每个文档向量坐标的l2范数,得到单位向量:
- tfidf("糟糕的;,"坏delivery" = 0.5/√(0.5²²+ 0.702733²+ 0.0)= 0.579739
- tfidf("delivery", "bad delivery") = 0.702733/√(0.5²+ 0.702733²+ 0.0²)= 0.814802
- tfidf("product","坏delivery" = 0.0/√(0.5²²+ 0.702733²+ 0.0)= 0.0
- tfidf("糟糕,"坏product") = 0.5/√(0.5²²+ 0.0²+ 0.702733)= 0.579739
- tfidf("delivery","坏product" = 0.0/√(0.5²²+ 0.0²+ 0.702733)= 0.0
- tfidf("product" bad product") = 0.702733/√(0.5²+ 0.0²+ 0.702733²)= 0.814802
您也可以在scikit-learn源代码中查看相关代码。