我正在尝试构建一个术语文档矩阵,该矩阵列出了语料库中的所有单字符,但也提取了双字符的特定列表。例如,在"use your turn signal"这个句子中,它会列出"use", "your"one_answers"turn signal"。
在文档中,他们提供的示例标记器是:
strsplit_space_tokenizer <- function(x) unlist(strsplit(as.character(x), "[[:space:]]+"))
关于如何编写一个标记器,找到一个给定的双元向量,并返回剩下的作为一元的想法吗?
谢谢!
这是一个可能的策略。基本上,您可以忽略文本,找到双元组并将其替换为不会在空格上分裂的内容(这里我使用"{0}",其中实际数字是列表中双元组的索引)。然后拆分字符串,然后遍历并将"{0}"的值替换为双元数据值。例如,这里有一个函数,它将使用双元数组
构建一个标记器getBigramTokenizer <- function(bigrams=character(0)) {
force(bigrams)
return(function(x) {
x <- Reduce(function(a,b)
gsub(bigrams[b],paste0("{",b,"}"),a, fixed=T),
seq_along(bigrams), x)
x <- unlist(strsplit(as.character(x), "[[:space:]]+"))
m<-regexec("\{(\d+)\}", x)
i<-which(sapply(m, '[', 1) != -1)
mi<-sapply(regmatches(x,m)[i], '[', 2)
x[i]<-bigrams[as.numeric(mi)]
x
})
}
现在我们可以和
一起使用bigrams <- c("turn signal", "back seat", "buckle up")
tk <- getBigramTokenizer(bigrams)
tk("use your turn signal")
# [1] "use" "your" "turn signal"
tk("please buckle up in the back seat")
# [1] "please" "buckle up" "in" "the" "back seat"
如果我理解正确,那么qdap 2.1.1版本也可以在这里提供帮助:
library(tm)
library(qdap)
## the bigrams
bigrams <- c("turn signal", "back seat", "buckle up")
## fake data (MWE)
dat <- data.frame(docs=paste0("doc", 1:5),
state=c("use your turn signal",
"please buckle up in the back seat",
"buckle up for safety",
"Sit in the back seat",
"here it is"
)
)
## make data into a Corpus
myCorp <- as.Corpus(dat$state, dat$docs)
myDF <- as.data.frame(myCorp)
f <- sub_holder(bigrams, myDF$text)
tdm <- as.tdm(f$output, myDF$docs)
rownames(tdm) <- f$unhold(rownames(tdm))
inspect(tdm)
## Docs
## Terms doc1 doc2 doc3 doc4 doc5
## for 0 0 1 0 0
## here 0 0 0 0 1
## in 0 1 0 1 0
## is 0 0 0 0 1
## it 0 0 0 0 1
## please 0 1 0 0 0
## turn signal 1 0 0 0 0
## back seat 0 1 0 1 0
## buckle up 0 1 1 0 0
## safety 0 0 1 0 0
## sit 0 0 0 1 0
## the 0 1 0 1 0
## use 1 0 0 0 0
## your 1 0 0 0 0