r语言 - 使用特定令牌构建术语-文档矩阵(以及所有其他内容) - r - Building Term-Document Matrix with specific tokens (and all the rest) 小贝子编程网

我正在尝试构建一个术语文档矩阵，该矩阵列出了语料库中的所有单字符，但也提取了双字符的特定列表。例如，在"use your turn signal"这个句子中，它会列出"use"， "your"one_answers"turn signal"。

在文档中，他们提供的示例标记器是:

strsplit_space_tokenizer <- function(x) unlist(strsplit(as.character(x), "[[:space:]]+"))

关于如何编写一个标记器，找到一个给定的双元向量，并返回剩下的作为一元的想法吗?

谢谢!

这是一个可能的策略。基本上，您可以忽略文本，找到双元组并将其替换为不会在空格上分裂的内容(这里我使用"{0}"，其中实际数字是列表中双元组的索引)。然后拆分字符串，然后遍历并将"{0}"的值替换为双元数据值。例如，这里有一个函数，它将使用双元数组

构建一个标记器

getBigramTokenizer <- function(bigrams=character(0)) {
    force(bigrams)
    return(function(x) {
        x <- Reduce(function(a,b) 
            gsub(bigrams[b],paste0("{",b,"}"),a, fixed=T), 
            seq_along(bigrams), x)
        x <- unlist(strsplit(as.character(x), "[[:space:]]+"))
        m<-regexec("\{(\d+)\}", x)
        i<-which(sapply(m, '[', 1) != -1)
        mi<-sapply(regmatches(x,m)[i], '[', 2)
        x[i]<-bigrams[as.numeric(mi)]
        x
     })
}

现在我们可以和

一起使用

bigrams <- c("turn signal", "back seat", "buckle up")
tk <- getBigramTokenizer(bigrams)
tk("use your turn signal")
# [1] "use"         "your"        "turn signal"
tk("please buckle up in the back seat")
# [1] "please"    "buckle up" "in"        "the"       "back seat"

如果我理解正确，那么qdap 2.1.1版本也可以在这里提供帮助:

library(tm)
library(qdap)
## the bigrams
bigrams <- c("turn signal", "back seat", "buckle up")
## fake data (MWE)
dat <- data.frame(docs=paste0("doc", 1:5), 
    state=c("use your turn signal",
        "please buckle up in the back seat",
        "buckle up for safety",
        "Sit in the back seat",
        "here it is"
    )
)
## make data into a Corpus
myCorp <- as.Corpus(dat$state, dat$docs)
myDF <- as.data.frame(myCorp)
f <- sub_holder(bigrams, myDF$text)
tdm <- as.tdm(f$output, myDF$docs)
rownames(tdm) <- f$unhold(rownames(tdm))
inspect(tdm)
##              Docs
## Terms         doc1 doc2 doc3 doc4 doc5
##   for            0    0    1    0    0
##   here           0    0    0    0    1
##   in             0    1    0    1    0
##   is             0    0    0    0    1
##   it             0    0    0    0    1
##   please         0    1    0    0    0
##   turn signal    1    0    0    0    0
##   back seat      0    1    0    1    0
##   buckle up      0    1    1    0    0
##   safety         0    0    1    0    0
##   sit            0    0    0    1    0
##   the            0    1    0    1    0
##   use            1    0    0    0    0
##   your           1    0    0    0    0

r语言 - 使用特定令牌构建术语-文档矩阵(以及所有其他内容)

相关内容

最新更新

热门标签：