过去,我在为我的一个文档构建 tf-idf 方面得到了帮助,并得到了我想要的输出(请参阅下文(。
TagSet <- data.frame(emoticon = c("🤔","🍺","💪","🥓","😃"),
stringsAsFactors = FALSE)
TextSet <- data.frame(tweet = c("🤔Sharp, adversarial⚔️~pro choice💪~ban Pit Bulls☠️~BSL🕊️~aberant psychology😈~common sense🤔~the Piper will lead us to reason🎵~sealskin woman🐺",
"Blocked by Owen, Adonis. Abbott & many #FBPE😃 Love seaside, historic houses & gardens, family & pets. RTs & likes/ Follows may=interest not agreement 🇬🇧",
"🇺🇸🇺🇸🇺🇸🇺🇸 #healthy #vegetarian #beatchronicillness fix infrastructure",
"LIBERTY-IDENTITARIAN. My bio, photo at Site Info. And kindly add my site to your Daily Favorites bar. Thank you, Eric",
"💙🖤I #BackTheBlue for my son!🖤💙 Facts Over Feelings. Border Security saves lives! #ThankYouICE",
"🤔🇺🇸🇺🇸 I play Pedal Steel @CooderGraw & #CharlieShafter🇺🇸🇺🇸 #GoStars #LiberalismIsAMentalDisorder",
"#Englishman #Londoner @Chelseafc 🕵️♂️ 🥓🚁 🍺 🏴🇬🇧🇨🇿",
"F*** the Anti-White Agenda #Christian #Traditional #TradThot #TradGirl #European #MAGA #AltRight #Folk #Family #WhitePride",
"🌸🐦❄️Do not dwell in tbaconhe past, do not dream of the future, concentrate the mind on the present moment.🌸🐿️❄️",
"Ordinary girl in a messed up World | Christian | Anti-War | Anti-Zionist | Pro-Life | Pro 🇸🇪 | 👋🏼Hello intro on the Minds Link |"),
stringsAsFactors = FALSE)
library(dplyr)
library(quanteda)
tweets_dfm <- dfm(TextSet$tweet) # convert to document-feature matrix
tweets_dfm %>%
dfm_select(TagSet$emoticon) %>% # only leave emoticons in the dfm
dfm_tfidf() %>% # weight with tfidf
convert("data.frame") # turn into data.frame to display more easily
# document 🤔 🍺 💪 🥓 😃
# 1 text1 1.39794 1 0 0 0
# 2 text2 0.00000 0 1 0 0
# 3 text3 0.00000 0 0 0 0
# 4 text4 0.00000 0 0 0 0
# 5 text5 0.00000 0 0 0 0
# 6 text6 0.69897 0 0 0 0
# 7 text7 0.00000 0 0 1 1
# 8 text8 0.00000 0 0 0 0
# 9 text9 0.00000 0 0 0 0
# 10 text10 0.00000 0 0 0 0
但是我需要一点帮助来计算每个单数项的 tf-idf。这意味着,如何从矩阵中准确获取每个项的 tf-idf 值?
# terms tfidf
# 🤔 #its tfidf the correct way
# 🍺 #its tfidf the correct way
# 💪 #its tfidf the correct way
# 🥓 #its tfidf the correct way
# 😃 #its tfidf the correct way
我敢肯定,这不像从其矩阵列中添加一个术语的所有 tf-idf 并除以它出现的文档。这将是该术语的价值。
我已经看了一些资料,例如这里,https://stats.stackexchange.com/questions/422750/how-to-calculate-tf-idf-for-a-single-term,但作者完全从我读到的内容中问了别的东西。
我目前在文本挖掘/分析术语方面很弱。
简而言之,您无法计算每个特征的 tf-idf 值,该值与其文档上下文隔离,因为特征的每个 tf-idf 值都特定于文档。
更具体地说:
- (反向(文档频率是每个要素一个值,因此按 $j$ 编制索引 术语
- 频率是每个文档每个术语一个值,因此按 $ij$ 编制索引
- 因此,TF-IDF 按 $i,j$ 编制索引
您可以在示例中看到以下内容:
> tweets_dfm %>%
+ dfm_tfidf() %>%
+ dfm_select(TagSet$emoticon) %>% # only leave emoticons in the dfm
+ as.matrix()
features
docs U0001f914 U0001f4aa U0001f603 U0001f953 U0001f37a
text1 1.39794 1 0 0 0
text2 0.00000 0 1 0 0
text3 0.00000 0 0 0 0
text4 0.00000 0 0 0 0
text5 0.00000 0 0 0 0
text6 0.69897 0 0 0 0
text7 0.00000 0 0 1 1
text8 0.00000 0 0 0 0
text9 0.00000 0 0 0 0
text10 0.00000 0 0 0 0
还有两件事:
按特征求平均值并不是真正有意义的事情,因为反向文档频率的特征已经是一种平均值,或者至少是出现术语的文档的反比例。 此外,这通常会被记录下来,因此在平均之前需要进行一些转换。
上面,我在删除其他特征之前计算了 tf-idf,因为如果您使用相对("归一化"(项频率,这将重新定义项频率。 默认情况下,
dfm_tfidf()
使用术语计数,因此此处的结果不受影响。