R Quanteda基于关键字创建并计算共现百分比



您好,我有以下数据集:

df <- data.frame (text  = c("House Sky Blue",
"House Sky Green",
"House Sky Red",
"House Sky Yellow",
"House Sky Green",
"House Sky Glue",
"House Sky Green"))

我想找到令牌的某些条款共同出现的百分比,例如,我想在所有文档中找到令牌"House"同时有多少还包含了"绿色"这个词

在out数据中,我们有7个文档包含术语House和3个out of这些7 p=(100*3/7)也包括绿色这个词,如果能看到也会很好哪些术语或令牌会出现在与令牌"House">

我已经用了这两招

textstat_collocations(tokens)
> textstat_collocations(tokens)
collocation count count_nested length   lambda        z
1   house sky     7            0      2 5.416100 2.622058
2   sky green     3            0      2 2.456736 1.511653

有趣textstat_simil

textstat_simil(dfm(tokens),margin="features")
textstat_simil object; method = "correlation"
house sky   blue  green    red yellow   glue
house    NaN NaN    NaN    NaN    NaN    NaN    NaN
sky      NaN NaN    NaN    NaN    NaN    NaN    NaN
blue     NaN NaN  1.000 -0.354 -0.167 -0.167 -0.167
green    NaN NaN -0.354  1.000 -0.354 -0.354 -0.354
red      NaN NaN -0.167 -0.354  1.000 -0.167 -0.167
yellow   NaN NaN -0.167 -0.354 -0.167  1.000 -0.167
glue     NaN NaN -0.167 -0.354 -0.167 -0.167  1.000

,但他们似乎没有给我想要的输出,我也想知道为什么绿色和房子的相关性是NaN的textsats_simil乐趣

我想要的输出将显示以下信息:

feature="House"
percentage of co-occurrence 
Green = 3/7
Blue= 1/7
Red = 1/7
Yellow = 1/7
Glue = 1/7

我是一个超级细心的用户,我会投票和选择最好的答案,谢谢你这么多的家伙为你的帮助,因为在quetda文档我似乎找不到一个乐趣,可以给我我想要的输出,虽然我知道必须有一个办法,因为我发现这个库是如此之快和完整!我将期待一个解决方案,只起诉quanteda库,再次感谢你们

实现此目的的一种方法是使用fcm()获取目标特性的文档级共现。下面,我将展示如何使用fcm()fcm_remove()来删除目标特征,然后使用循环来获得所需的打印输出。

library("quanteda")
#> Package version: 3.2.4
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
df <- data.frame(text = c("House Sky Blue",
"House Sky Green",
"House Sky Red",
"House Sky Yellow",
"House Sky Green",
"House Sky Glue",
"House Sky Green"))
corp <- corpus(df)
coocc_fract <- function(corp, feature) {
# create a document-level co-occurrence matrix
fcmat <- fcm(dfm(tokens(corp), tolower = FALSE), context = "document")
# select for the given feature
fcmat <- fcm_remove(fcmat, feature)
cat("feature="", feature, ""n", sep = "")
cat(" percentage of co-occurrencenn")
for (f in featnames(fcmat)) {
# skip zeroes
freq <- as.character(fcmat[1, f])
if (freq != "0") {
cat(f, " = ", as.character(fcmat[1, f]), "/", ndoc(corp), 
"n", sep = "")
}
}
}

产生如下输出:

coocc_fract(corp, feature = "House")
#> feature="House"
#>  percentage of co-occurrence
#> 
#> Blue = 1/7
#> Green = 3/7
#> Red = 1/7
#> Yellow = 1/7
#> Glue = 1/7

创建于2023-01-02与reprex v2.0.2

我在quanteda里面找不到任何东西,所以我拼凑了一些东西。一个函数用于创建包含所选单词和频率表的列表对象,另一个print函数用于打印所需的输出。您可以调整函数以只返回您想要的内容,并添加更多测试来检查输入。

代码部分:

dat <- data.frame (text  = c("House Sky Blue",
"House Sky Green",
"House Sky Red",
"House Sky Yellow",
"House Sky Green",
"House Sky Glue",
"House Sky Green"))

library(quanteda)
library(quanteda.textstats)
my_dfm <- dfm(tokens(corpus(dat)))
freqs <- textstat_frequency(my_dfm)
# create function to return a list with the chosen word and a frequency table    
create_co_occurrence <- function(x, word) {

if(!inherits(x, "frequency")) {
stop("x must be a frequency table generated by textstat_frequency." 
,call. = FALSE)
}

# add check to see if word is a character

input <- x

word_frequency <- input$frequency[input$feature == word]

out <- input[input$feature != word, ]
out$percentage <- out$frequency / word_frequency
out <- out[, c("feature", "percentage")]
# reset row.names
row.names(out) <- NULL
out_list <- list(word = word,
co_occurrence = out)

class(out_list) <- c("co_occurrence", "list")
out_list
}
# create print function.
print.co_occurrence <- function(x, ...) {

writeLines(sprintf("feature = %s"  , x$word))
writeLines("percentage of co-occurrence
")
print.data.frame(x$co_occurrence)
}
输出:

test <- create_co_occurrence(freqs, "house")
# calling test will activate the print.co_occurrence function and format the results
test
feature = house
percentage of co-occurrence

feature percentage
2     sky  1.0000000
3   green  0.4285714
4    blue  0.1428571
5     red  0.1428571
6  yellow  0.1428571
7    glue  0.1428571

最新更新