我有包含短字符串的简单数据框,每个数据框都分配了一个特定的类:
datadb <- data.frame (
Class = c('Class1', 'Class2', 'Class3'),
Document = c('This is test', 'Yet another test', 'A last test')
)
datadb$Document <- tolower(datadb$Document)
datadb$Tokens <- strsplit(datadb$Document, " ")
由此,我想构建另一个数据框,其中包含原始Class1
列,但为每个唯一令牌添加了一个新列,如下所示:
all_tokens <- unlist(datadb$Tokens)
all_tokens <- unique(all_tokens)
number_of_columns <- length(all_tokens)
number_of_rows <- NROW(datadb)
tokenDB <- data.frame( matrix(ncol=(1 + number_of_columns), nrow=number_of_rows) )
names(tokenDB) <- c("Classification", all_tokens)
tokenDB$Classification <- datadb$Class
然后,tokenDB
将如下所示:
Classification this is test yet another a last
1 Class1 NA NA NA NA NA NA NA
2 Class2 NA NA NA NA NA NA NA
3 Class3 NA NA NA NA NA NA NA
如何遍历原始数据框并向新tokenDB
添加一个值,该值对应于已识别的每个向量?输出应如下所示:
Classification this is test yet another a last
1 Class1 1 1 1 0 0 0 0
2 Class2 0 0 1 1 1 0 0
3 Class3 0 0 1 0 0 1 1
理想情况下,输出应该是 data.frame,但也可以是矩阵。
使用tm
包或任何其他文本挖掘包来完成工作。我偏爱tm
.您正在创建的是一个文档术语矩阵。
library(tm)
datadb <- data.frame (
Class = c('Class1', 'Class2', 'Class3'),
Document = c('This is test', 'Yet another test', 'A last test')
)
corpus <- Corpus(VectorSource(datadb$Document))
dtm <- DocumentTermMatrix(corpus)
dtm2 <- cbind(datadb$Class, as.matrix(dtm))
colnames(dtm2) <- c("Classification", colnames(dtm))
dtm2
# Classification test this another yet last
# 1 1 1 1 0 0 0
# 2 2 1 0 1 1 0
# 3 3 1 0 0 0 1
这是仅使用base
的另一种方法
txt <- lapply(txt, function(x) data.frame(x, count = 1))
txt <- lapply(txt, function(x) data.frame(count = tapply(x$count, x$x, sum)))
tdm <- Reduce(function(...) merge(..., all=TRUE, by="x"),
lapply(txt, function(x) data.frame(x=rownames(x), count=x$count)))
rownames(tdm) <- tdm[, 1]
dtm3 <- t(tdm[, -1])
dtm3[is.na(dtm3)] <- 0
rownames(dtm3) <- paste("Doc", 1:3)
dtm3 <- cbind(Classification=datadb$Class, dtm3)
dtm3
# Classification is test This another Yet A last
# Doc 1 1 1 1 1 0 0 0 0
# Doc 2 2 0 1 0 1 1 0 0
# Doc 3 3 0 1 0 0 0 1 1
k=lapply( datadb$Tokens,match,all_tokens)
tokenDB[,-1]=t(mapply(function(x,y) {y[x]<-1;y[-x]<-0;y}, k,data.frame(t(tokenDB[,-1]))))
tokenDB
Classification this is test yet another a last
1 Class1 1 1 1 0 0 0 0
2 Class2 0 0 1 1 1 0 0
3 Class3 0 0 1 0 0 1 1