我试图在一个有很多空文档的数据集上计算tfidf。我想在没有空文档的情况下计算tfidf,但仍然有一个带有原始文档数量的dfm对象作为输出。
下面是一个例子:
texts = c("", "Bonjour!", "Hello, how are you", "", "Good", "", "", "")
a = texts %>%
tokens(tolower=T, remove_punct=T) %>%
dfm() %>%
dfm_wordstem() %>%
dfm_remove(stopwords("en")) %>%
dfm_tfidf()
print(a, max_ndoc=10)
Document-feature matrix of: 8 documents, 3 features (87.50% sparse) and 0 docvars.
features
docs bonjour hello good
text1 0 0 0
text2 0.90309 0 0
text3 0 0.90309 0
text4 0 0 0
text5 0 0 0.90309
text6 0 0 0
text7 0 0 0
text8 0 0 0
但是IDF受到空文档数量的影响,这是我不希望看到的。因此,我在非空文档子集上计算tfidf,如下所示:
a2 = texts %>%
tokens(tolower=T, remove_punct=T) %>%
dfm() %>%
dfm_subset(ntoken(.) > 0) %>%
dfm_wordstem() %>%
dfm_remove(stopwords("en")) %>%
dfm_tfidf()
print(a2, max_ndoc=10)
Document-feature matrix of: 3 documents, 3 features (66.67% sparse) and 0 docvars.
features
docs bonjour hello good
text2 0.4771213 0 0
text3 0 0.4771213 0
text5 0 0 0.4771213
我现在想要一个与第一个矩阵格式相同的稀疏矩阵,但使用先前的文本值。我在stackoverflow上找到了这个代码:https://stackoverflow.com/a/65635722
add_rows_2 <- function(M,v) {
oldind <- unique(M@i)
## new row indices
newind <- oldind + as.integer(rowSums(outer(oldind,v,">=")))
## modify dimensions
M@Dim <- M@Dim + c(length(v),0L)
M@i <- newind[match(M@i,oldind)]
M
}
empty_texts_idx = which(texts=="")
position_after_insertion = empty_texts_idx - 1:(length(empty_texts_idx))
a3 = add_rows_2(a2, position_after_insertion)
print(a3, max_ndoc=10)
Document-feature matrix of: 8 documents, 3 features (87.50% sparse) and 0 docvars.
features
docs bonjour hello good
text2.1 0 0 0
text3.1 0.4771213 0 0
text5.1 0 0.4771213 0
NA.NA 0 0 0
NA.NA 0 0 0.4771213
NA.NA 0 0 0
NA.NA 0 0 0
NA.NA 0 0 0
这是我想要的,并且空文本已添加到矩阵的适当行。
问题1:我想知道是否有一种更有效的方法可以直接使用quanteda
包…
问题2:……或者至少一种不改变dfm对象结构的方法,因为a3
和a
没有相同的docvars
属性。
print(a3@docvars)
docname_ docid_ segid_
1 text2 text2 1
2 text3 text3 1
3 text5 text5 1
print(docnames(a3))
[1] "text2" "text3" "text5"
print(a@docvars)
docname_ docid_ segid_
1 text1 text1 1
2 text2 text2 1
3 text3 text3 1
4 text4 text4 1
5 text5 text5 1
6 text6 text6 1
7 text7 text7 1
8 text8 text8 1
我可以有一个"正确的";通过运行以下代码行来格式化a3
# necessary to print proper names in 'docs' column
new_docvars = data.frame(docname_=paste0("text",1:length(textes3)) %>% as.factor(), docid_=paste0("text",1:length(textes3))%>% as.factor(), segid_=rep(1,length(textes3)))
a3@docvars = new_docvars
# The following line is necessary for cv.glmnet to run using a3 as covariates
docnames(a3) <- paste0("text",1:length(textes3))
# seems equivalent to a3@Dimnames$docs <- paste0("text",1:length(textes3))
print(a3, max_ndoc=10)
Document-feature matrix of: 8 documents, 3 features (87.50% sparse) and 0 docvars.
features
docs bonjour hello good
text1 0 0 0
text2 0.4771213 0 0
text3 0 0.4771213 0
text4 0 0 0
text5 0 0 0.4771213
text6 0 0 0
text7 0 0 0
text8 0 0 0
print(a3@docvars) # this is now as expected
docname_ docid_ segid_
1 text1 text1 1
2 text2 text2 1
3 text3 text3 1
4 text4 text4 1
5 text5 text5 1
6 text6 text6 1
7 text7 text7 1
8 text8 text8 1
print(docnames(a3)) # this is now as expected
[1] "text1" "text2" "text3" "text4" "text5" "text6" "text7" "text8"
我需要更改文档名称(a3),因为我想使用a3作为我想用cv.glmet
训练的模型的协变量,但如果我不更改a3的文档名称,我会得到一个错误。这是进行定量分析的正确方法吗?我觉得手动更改医生不是正确的方法,我在网上找不到任何关于这方面的信息。如有任何见解,我将不胜感激。
谢谢!
我不知道在计算tf-idf之前删除空文档是否是个好主意,但是很容易用drop_docid = FALSE
和fill = TRUE
恢复删除的文档,因为quanteda跟踪他们。
require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 66.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- c("", "Bonjour!", "Hello, how are you", "", "Good", "", "", "")
corp <- corpus(txt)
dfmt <- dfm(tokens(corp))
dfmt
#> Document-feature matrix of: 8 documents, 8 features (87.50% sparse) and 0 docvars.
#> features
#> docs bonjour ! hello , how are you good
#> text1 0 0 0 0 0 0 0 0
#> text2 1 1 0 0 0 0 0 0
#> text3 0 0 1 1 1 1 1 0
#> text4 0 0 0 0 0 0 0 0
#> text5 0 0 0 0 0 0 0 1
#> text6 0 0 0 0 0 0 0 0
#> [ reached max_ndoc ... 2 more documents ]
dfmt2 <- dfm_subset(dfmt, ntoken(dfmt) > 0, drop_docid = FALSE) %>%
dfm_tfidf()
dfmt2
#> Document-feature matrix of: 3 documents, 8 features (66.67% sparse) and 0 docvars.
#> features
#> docs bonjour ! hello , how are you
#> text2 0.4771213 0.4771213 0 0 0 0 0
#> text3 0 0 0.4771213 0.4771213 0.4771213 0.4771213 0.4771213
#> text5 0 0 0 0 0 0 0
#> features
#> docs good
#> text2 0
#> text3 0
#> text5 0.4771213
dfmt3 <- dfm_group(dfmt2, fill = TRUE, force = TRUE)
dfmt3
#> Document-feature matrix of: 8 documents, 8 features (87.50% sparse) and 0 docvars.
#> features
#> docs bonjour ! hello , how are you
#> text1 0 0 0 0 0 0 0
#> text2 0.4771213 0.4771213 0 0 0 0 0
#> text3 0 0 0.4771213 0.4771213 0.4771213 0.4771213 0.4771213
#> text4 0 0 0 0 0 0 0
#> text5 0 0 0 0 0 0 0
#> text6 0 0 0 0 0 0 0
#> features
#> docs good
#> text1 0
#> text2 0
#> text3 0
#> text4 0
#> text5 0.4771213
#> text6 0
#> [ reached max_ndoc ... 2 more documents ]
由reprex包(v2.0.1)创建于2022-06-16