r语言 - quanteda:删除空文档以计算tfidf,但将其保留在最终dfm中



我试图在一个有很多空文档的数据集上计算tfidf。我想在没有空文档的情况下计算tfidf,但仍然有一个带有原始文档数量的dfm对象作为输出。

下面是一个例子:

texts = c("", "Bonjour!", "Hello, how are you", "", "Good", "", "", "")
a = texts %>%
tokens(tolower=T, remove_punct=T) %>%
dfm() %>%
dfm_wordstem() %>%
dfm_remove(stopwords("en")) %>%
dfm_tfidf()
print(a, max_ndoc=10)
Document-feature matrix of: 8 documents, 3 features (87.50% sparse) and 0 docvars.
features
docs    bonjour   hello    good
text1 0       0       0      
text2 0.90309 0       0      
text3 0       0.90309 0      
text4 0       0       0      
text5 0       0       0.90309
text6 0       0       0      
text7 0       0       0      
text8 0       0       0    

但是IDF受到空文档数量的影响,这是我不希望看到的。因此,我在非空文档子集上计算tfidf,如下所示:

a2 = texts %>%
tokens(tolower=T, remove_punct=T) %>%
dfm() %>%
dfm_subset(ntoken(.) > 0) %>%
dfm_wordstem() %>%
dfm_remove(stopwords("en")) %>%
dfm_tfidf()
print(a2, max_ndoc=10)
Document-feature matrix of: 3 documents, 3 features (66.67% sparse) and 0 docvars.
features
docs      bonjour     hello      good
text2 0.4771213 0         0        
text3 0         0.4771213 0        
text5 0         0         0.4771213

我现在想要一个与第一个矩阵格式相同的稀疏矩阵,但使用先前的文本值。我在stackoverflow上找到了这个代码:https://stackoverflow.com/a/65635722

add_rows_2 <- function(M,v) {
oldind <- unique(M@i)
## new row indices
newind <- oldind + as.integer(rowSums(outer(oldind,v,">=")))
## modify dimensions
M@Dim <- M@Dim + c(length(v),0L)
M@i <- newind[match(M@i,oldind)]
M
}
empty_texts_idx = which(texts=="")
position_after_insertion = empty_texts_idx - 1:(length(empty_texts_idx))
a3 = add_rows_2(a2, position_after_insertion)
print(a3, max_ndoc=10)
Document-feature matrix of: 8 documents, 3 features (87.50% sparse) and 0 docvars.
features
docs        bonjour     hello      good
text2.1 0         0         0        
text3.1 0.4771213 0         0        
text5.1 0         0.4771213 0        
NA.NA   0         0         0        
NA.NA   0         0         0.4771213
NA.NA   0         0         0        
NA.NA   0         0         0        
NA.NA   0         0         0        

这是我想要的,并且空文本已添加到矩阵的适当行。

问题1:我想知道是否有一种更有效的方法可以直接使用quanteda包…

问题2:……或者至少一种不改变dfm对象结构的方法,因为a3a没有相同的docvars属性。

print(a3@docvars)
docname_ docid_ segid_
1    text2  text2      1
2    text3  text3      1
3    text5  text5      1
print(docnames(a3))
[1] "text2" "text3" "text5"
print(a@docvars)
docname_ docid_ segid_
1    text1  text1      1
2    text2  text2      1
3    text3  text3      1
4    text4  text4      1
5    text5  text5      1
6    text6  text6      1
7    text7  text7      1
8    text8  text8      1

我可以有一个"正确的";通过运行以下代码行来格式化a3

# necessary to print proper names in 'docs' column
new_docvars = data.frame(docname_=paste0("text",1:length(textes3)) %>% as.factor(), docid_=paste0("text",1:length(textes3))%>% as.factor(), segid_=rep(1,length(textes3)))
a3@docvars = new_docvars
# The following line is necessary for cv.glmnet to run using a3 as covariates
docnames(a3) <- paste0("text",1:length(textes3)) 
# seems equivalent to a3@Dimnames$docs <- paste0("text",1:length(textes3))
print(a3, max_ndoc=10)
Document-feature matrix of: 8 documents, 3 features (87.50% sparse) and 0 docvars.
features
docs      bonjour     hello      good
text1 0         0         0        
text2 0.4771213 0         0        
text3 0         0.4771213 0        
text4 0         0         0        
text5 0         0         0.4771213
text6 0         0         0        
text7 0         0         0        
text8 0         0         0
print(a3@docvars) # this is now as expected
docname_ docid_ segid_
1    text1  text1      1
2    text2  text2      1
3    text3  text3      1
4    text4  text4      1
5    text5  text5      1
6    text6  text6      1
7    text7  text7      1
8    text8  text8      1
print(docnames(a3)) # this is now as expected
[1] "text1" "text2" "text3" "text4" "text5" "text6" "text7" "text8"

我需要更改文档名称(a3),因为我想使用a3作为我想用cv.glmet训练的模型的协变量,但如果我不更改a3的文档名称,我会得到一个错误。这是进行定量分析的正确方法吗?我觉得手动更改医生不是正确的方法,我在网上找不到任何关于这方面的信息。如有任何见解,我将不胜感激。

谢谢!

我不知道在计算tf-idf之前删除空文档是否是个好主意,但是很容易用drop_docid = FALSEfill = TRUE恢复删除的文档,因为quanteda跟踪他们。

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 66.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- c("", "Bonjour!", "Hello, how are you", "", "Good", "", "", "")
corp <- corpus(txt)
dfmt <- dfm(tokens(corp))
dfmt
#> Document-feature matrix of: 8 documents, 8 features (87.50% sparse) and 0 docvars.
#>        features
#> docs    bonjour ! hello , how are you good
#>   text1       0 0     0 0   0   0   0    0
#>   text2       1 1     0 0   0   0   0    0
#>   text3       0 0     1 1   1   1   1    0
#>   text4       0 0     0 0   0   0   0    0
#>   text5       0 0     0 0   0   0   0    1
#>   text6       0 0     0 0   0   0   0    0
#> [ reached max_ndoc ... 2 more documents ]

dfmt2 <- dfm_subset(dfmt, ntoken(dfmt) > 0, drop_docid = FALSE) %>% 
dfm_tfidf()
dfmt2
#> Document-feature matrix of: 3 documents, 8 features (66.67% sparse) and 0 docvars.
#>        features
#> docs      bonjour         !     hello         ,       how       are       you
#>   text2 0.4771213 0.4771213 0         0         0         0         0        
#>   text3 0         0         0.4771213 0.4771213 0.4771213 0.4771213 0.4771213
#>   text5 0         0         0         0         0         0         0        
#>        features
#> docs         good
#>   text2 0        
#>   text3 0        
#>   text5 0.4771213
dfmt3 <- dfm_group(dfmt2, fill = TRUE, force = TRUE)
dfmt3
#> Document-feature matrix of: 8 documents, 8 features (87.50% sparse) and 0 docvars.
#>        features
#> docs      bonjour         !     hello         ,       how       are       you
#>   text1 0         0         0         0         0         0         0        
#>   text2 0.4771213 0.4771213 0         0         0         0         0        
#>   text3 0         0         0.4771213 0.4771213 0.4771213 0.4771213 0.4771213
#>   text4 0         0         0         0         0         0         0        
#>   text5 0         0         0         0         0         0         0        
#>   text6 0         0         0         0         0         0         0        
#>        features
#> docs         good
#>   text1 0        
#>   text2 0        
#>   text3 0        
#>   text4 0        
#>   text5 0.4771213
#>   text6 0        
#> [ reached max_ndoc ... 2 more documents ]

由reprex包(v2.0.1)创建于2022-06-16

最新更新